Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Perl Newbie

by zeroc00l (Initiate)
on Aug 24, 2012 at 14:46 UTC ( #989561=perlquestion: print w/replies, xml ) Need Help??

zeroc00l has asked for the wisdom of the Perl Monks concerning the following question:

Hi all! I'm a newbie on Perl for a problem here at work that need it. I must import 1200 html files from the old site to the new site, made with joomla, I found a perl script that tries to import everything except the date, this script uses tha creation date of the you can imagine this is my problem, I want that the script import the date printed in every html file in a tag that I declared as  <date>...</date>
I pass firstly the files with a php function I wrote to elminate everything I don't need from the html files, so I obtain a plain file with just the <title>....</title><p><date>...</date> the text of the new</p> I use the perl script to generate the importing of all these files into joomla and everything goes fine except that the edit I made doesn't function as I expect.. I post here all the function under the code tag. PS: the function i maded by Paco Hope.

#!/usr/bin/perl =pod =head1 - Import plain HTML files into Joomla database =head1 SYNOPSIS This script imports HTML files directly into Joomla's database. It works with both Joomla 1.0.12 and Joomla 1.5rc1 =head1 DESCRIPTION This script reads a directory to find all the HTML files and imports t +hem into a Joomla site. It reads the content of the HTML page and loo +ks at the <title> tag to determine the title of the page. It looks at + the file’s modification date to determine the “publication date” for + Joomla, and then it makes a MySQL database connection and executes t +he query. I hacked the original script together in about an hour. It probably wo +rks, but it is not for the faint of heart. If you barely understand J +oomla, and you’ve never looked at perl before, and you’re working fro +m a Windows PC, this isn't ready for you yet. If you’re using a Mac o +r Linux, or you’re comfortable running Perl on Windows, this is prett +y straightforward. =head1 OPTIONS This program takes the standard 2 options. =over 6 =item B<-n> Dry run. Don't do anything. Just show what would be done. It won't con +nect to the database, it won't insert any files. It will just show you the nam +es and dates of all the files that it would try to insert. =item B<-D> Old Database. Use the 1.0.12 database schema. Otherwise, the 1.5 schem +a is assumed. Use this option if you use Joomla 1.0.12. =item B<-r> Recurse. Descend subdirectories and find all their files and import th +em, too. Otherwise, subdirectories are skipped and only the files in directory +are inserted into the site. =item B<-f> Use the file name as the title, not the <title> from inside the file. +The current regular expression will strip off '.(asp|aspx|htm|html|shtm|sh +tml)' and use the remaining file name. =item B<-F> B<Append> the filename to the title of the page. For example, if the p +age contains <title>New Page 1</title> and it is named I<menu.htm>, t +hen this will set the title (in Joomla) to be "New Page 1 menu". Agai +n, as in B<-f>, common file extensions are stripped off. =back =head1 OUTPUT You'll see a line or two for each file processed. =head1 ERRORS The script blows up on errors. More details will follow. =head1 DIAGNOSTICS HTML files that have no <title> tags will be named "Article" unless B< +-f> or B<-F> are used. If you use the wrong database type (e.g., you forget to specify B<-D> +when you should), it will blow up at the SQL level, but keep on truckin. In oth +er words, you'll get lots and lots of errors. =head1 EXAMPLES perl -r html =over 2 This will go into the B<html> directory and find pages and upload them +. Because of B<-r>, it will descend into any subdirectories that are fou +nd and process HTML files there, too. =back =head1 AUTHOR Paco Hope <> =head1 COPYRIGHT Copyright (C) 2007 Paco Hope <> Distributed under the BSD +License. (See the bottom of this file) Original from Now at =head1 SEE ALSO DBI(3pm), DBD::MySQL(3pm), HTMLL::Parser =head1 ACKNOWLEDGEMENTS Thanks to David Glah from Cable & Wireless for the SQL update and the impetus to do recursion. =head1 INTERNAL DOCUMENTATION The remainder of this documentation is per-function, internal document +ation. It is only intended for the developers and maintainers of this code. =cut use strict; use HTML::Parser; use POSIX qw(strftime); use DBI; use DBD::mysql; use Getopt::Long; # Here's the MySQL database stuff you need to configure $db::user = "root"; $db::passwd = "zeroc00l"; $db::database = "hsrgiglio"; $db::hostname = "localhost"; $db::port = "3306"; $db::tablename = "jos_content"; $db::ver = "1.5"; # default # state for all articles (1=published) $j::state = 1; # numeric Joomla section and category where you want the articles inse +rted $j::section = 1; $j::category = 1; # numeric creator ID (62 = admin) for all articles $j::creator = 62; # By default, do not recurse. Use -r to enable recursion. $j::recurse = ''; ########### ### No need to change anything below here ########### # this first bit is right out of the HTML::Parser perldoc sub title_handler { return if shift ne "title"; my $self = shift; $self->handler( text => sub { $j::title = shift }, "dtext" ); $self->handler( end => sub { shift->eof if shift eq "title"; }, "tagname,self" ); } sub date_handler { return if shift ne "date"; my $self = shift; $self->handler( text => sub { $j::date = shift }, "dtext" ); $self->handler( end => sub { shift->eof if shift eq "date"; }, "tagname,self" ); } #sub date_handler { # my ($self, $tagname, $attr, $attrseq, $origtext) = @_; # if ($tagname eq 'date') { # end => sub { shift->eof if shift eq "date"; }, # "tagname,self" # } #} # Given a file name: # Parse it for <title> # Get its date from the filesystem # Insert it into the Joomla Database sub insertFile { my $file = shift; my $p = HTML::Parser->new( api_version => 3 ); $p->handler( start => \&title_handler, "tagname,self" ); $p->handler( start => \&date_handler, "tagname,self" ); $p->parse_file($file); # Get the mod time on the file, so we can set the creation time of + the # Joomla article to that time. This blatently taken from perldoc - +f stat my ( $dev, $ino, $mode, $nlink, $uid, $gid, $rdev, $size, $atime, $mtime, $ctime, $blksize, $blocks ) = stat($file); # Break $mtime down into its constituent parts. # This taken from perldoc -f localtime my ( $sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst ) + = localtime($mtime); # make a MySQL compatible date my $mysqlDate = strftime( "%F %T", $sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst ); # Open the file and stick its entire contents into $htmlBody my $htmlBody; open HTMLFILE, "<$file"; my $numread = read HTMLFILE, $htmlBody, $size; if( $j::dryrun ) { print " Titolo: \"$j::title\"\n"; # print " Date: \"$mysqlDate\"\n"; print " Data: \"$j::date\"\n"; } else { # $db::sth->execute( # $j::title, $j::title, $htmlBody, $j::state, $j::sec +tion, # $j::category, $mysqlDate, $j::creator, $mysqlDate # ); $db::sth->execute( $j::title, $j::title, $htmlBody, $j::state, $j::sect +ion, $j::category, $j::date, $j::creator, $j::date ); } } =pod =head2 sub processDir Given a directory, process all the entries in the directory. If we hav +e -r on the command line, then we will recurse into directories that we find. Otherwise, we skip them. =cut sub processDir { my $dir = shift; my $entry = ""; my $dirhandle; if ( !opendir( $dirhandle, $dir ) ) { warn "can't opendir $dir: $! (continuing)"; return; } # Go through all the dir entries, but ignore '.' and '..' while ( $entry = readdir($dirhandle) ) { next if "$entry" eq "."; next if "$entry" eq ".."; if ( -d "$dir/$entry" ) { print "Processing directory $dir/$entry\n"; # if we have a directory, and we want to recurse, call # processDir on it. if ($j::recurse) { processDir("$dir/$entry"); } next; } # Note that this ignores symbolic links, too. next unless -f "$dir/$entry"; print " + $entry\n"; insertFile("$dir/$entry"); } closedir DIR; } ### ### Begin Main ### # Default title for our articles, if one isn't defined in the HTML $j::title = "Article"; # Process command line arguments GetOptions( 'r' => \$j::recurse, 'D' => \$j::dbver, 'f' => \$j::useFileName, 'F' => \$j::appendFileName, 'n' => \$j::dryrun ); if( $j::dbver ) { # -D means use old database (1.0.12) $db::ver = "1.0.12"; } if ( $#ARGV != 0 ) { die "need a directory name ($#ARGV)"; } else { $j::dir = $ARGV[0]; if ( !-r $j::dir ) { die "can't open $j::dir"; } if ( !-d $j::dir ) { die "$j::dir is not a directory"; } } $db::dsn = "DBI:mysql:database=$db::database;host=$db::hostname"; if( $j::dryrun ) { print "Would connect to $db::dsn with $db::user and pass xxxx\n"; } else { $db::dbh = DBI->connect( $db::dsn, $db::user, $db::passwd ); } # Now build up the query my $q = "INSERT INTO `$db::tablename` VALUES "; # first int is the autoincrement field. We assume that will be set by +MySQL # date: 2007-07-04 21:07:51 # Depending on which version we've been asked to do if ( $db::ver eq "1.0.12" ) { $q .= "(null, ?, ?, ?, '', ?, ?, 0, ?, ?, ?, '', '0000-00-00 00:00 +:00', "; $q .= "0, 0, '0000-00-00 00:00:00', ?, '0000-00-00 00:00:00', '', +'', "; $q .= "'pageclass_sfx=\\nback_button=\\nitem_title=1\\nlink_titles +=\\nintrotext=1\\n"; $q .= "section=0\\nsection_link=0\\ncategory=0\\ncategory_link=0\\ +nrating=\\nauthor=\\n"; $q .= "createdate=\\nmodifydate=\\npdf=\\nprint=\\nemail=\\nkeyref +=\\ndocbook_type=', "; $q .= "1, 0, 1, '', '', 0, 0)"; } elsif ( $db::ver eq "1.5" ) { $q .= "(null, ?, '', ?, ?, '', ?, ?, 0, ?, ?, ?, '', '0000-00-00 0 +0:00:00', "; $q .= "0, 0, '0000-00-00 00:00:00', ?, '0000-00-00 00:00:00', '', +'', "; $q .= "'pageclass_sfx=\\nback_button=\\nitem_title=1\\nlink_titles +=\\nintrotext=1\\n"; $q .= "section=0\\nsection_link=0\\ncategory=0\\ncategory_link=0\\ +nrating=\\nauthor=\\n"; $q .= "createdate=\\nmodifydate=\\npdf=\\nprint=\\nemail=\\nkeyref +=\\ndocbook_type=', "; $q .= "1, 0, 1, '', '', 0, 0,'')"; } if( $j::dryrun ) { print "Using Joomla database schema for version $db::ver\n"; } else { # Prepare the query once. We'll execute it many times. $db::sth = $db::dbh->prepare($q); } print "processing '$j::dir'\n"; processDir($j::dir); if( ! $j::dryrun ) { $db::dbh->disconnect; } =pod =head1 LICENSE License Terms for this file. This is the BSD License. ( Copyright (c) 2007, Paco Hope All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are + met: =over 2 =item - Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. =item - Redistributions in binary form must reproduce the above copyright noti +ce, this list of conditions and the following disclaimer in the documentat +ion and/or other materials provided with the distribution. =item - Neither the name of Paco Hope nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. =back THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "A +S IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, +THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PUR +POSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS +BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINE +SS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER I +N CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE +) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF + THE POSSIBILITY OF SUCH DAMAGE. =cut

Replies are listed 'Best First'.
Re: Perl Newbie
by cheekuperl (Monk) on Aug 24, 2012 at 17:37 UTC
    Could you try and
    1)reduce the code size by eliminating (seemingly) non relevant parts of code (e.g. database handling)
    2) re-format the problem statement for clear understanding
Re: Perl Newbie
by aitap (Curate) on Aug 24, 2012 at 17:44 UTC
    my ( $dev, $ino, $mode, $nlink, $uid, $gid, $rdev, $size, $atime, $mtime, $ctime, $blksize, $blocks ) = stat($file); my ( $sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst ) + = localtime($mtime); # make a MySQL compatible date my $mysqlDate = strftime( "%F %T", $sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst );
    Try replacing this code with acquiring of the date you parsed from <data> tags ($j::date). You may need to parse and reformat the date, have a look at Date::Parse, localtime and strftime.
    Sorry if my advice was wrong.
      Hi, thank you for the reply, but perhaps my question was malformed, I try to be more clear: With that script everything goes well except the return of the date inside the tag <date>...</date> I presume that the problem is in this subfunction I replicated from the subfunction that try to get the Title of the page:
      # this first bit is right out of the HTML::Parser perldoc sub title_handler { return if shift ne "title"; my $self = shift; $self->handler( text => sub { $j::title = shift }, "dtext" ); $self->handler( end => sub { shift->eof if shift eq "title"; }, "tagname,self" ); } sub date_handler { return if shift ne "date"; my $self = shift; $self->handler( text => sub { $j::date = shift }, "dtext" ); $self->handler( end => sub { shift->eof if shift eq "date"; }, "tagname,self" ); }
      And then when I call it:
      sub insertFile { my $file = shift; my $p = HTML::Parser->new( api_version => 3 ); $p->handler( start => \&title_handler, "tagname,self" ); $p->handler( start => \&date_handler, "tagname,self" ); $p->parse_file($file); .................
      If I delete the call to date, the Title subfunction goes well, but I lost the date, on the other side if I left everything as I modified, the date goes well but I lost the Title. I hope that now is more clear.....:)
        Try running your code in the debugger. Does the $j::date variable get assigned? Perhaps you need it to look like the old one (I mean, the same format, try Date::Parse and strftime("%F %T",str2time($j::date))).
        Sorry if my advice was wrong.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://989561]
Approved by herveus
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2021-09-18 02:15 GMT
Find Nodes?
    Voting Booth?

    No recent polls found