comment on

Hi all! I'm a newbie on Perl for a problem here at work that need it. I must import 1200 html files from the old site to the new site, made with joomla, I found a perl script that tries to import everything except the date, this script uses tha creation date of the files....as you can imagine this is my problem, I want that the script import the date printed in every html file in a tag that I declared as <date>...</date>
I pass firstly the files with a php function I wrote to elminate everything I don't need from the html files, so I obtain a plain file with just the <title>....</title><p><date>...</date> the text of the new</p>
Ok....now I use the perl script to generate the importing of all these files into joomla and everything goes fine except that the edit I made doesn't function as I expect.. I post here all the function under the code tag. PS: the function i maded by Paco Hope.

#!/usr/bin/perl 
=pod

=head1 BulkImport.pl - Import plain HTML files into Joomla database

=head1 SYNOPSIS

This script imports HTML files directly into Joomla's database.
It works with both Joomla 1.0.12 and Joomla 1.5rc1

=head1 DESCRIPTION

This script reads a directory to find all the HTML files and imports t
+hem into a Joomla site. It reads the content of the HTML page and loo
+ks at the <title> tag to determine the title of the page. It looks at
+ the file’s modification date to determine the “publication date” for
+ Joomla, and then it makes a MySQL database connection and executes t
+he query.

I hacked the original script together in about an hour. It probably wo
+rks, but it is not for the faint of heart. If you barely understand J
+oomla, and you’ve never looked at perl before, and you’re working fro
+m a Windows PC, this isn't ready for you yet. If you’re using a Mac o
+r Linux, or you’re comfortable running Perl on Windows, this is prett
+y straightforward.

=head1 OPTIONS

This program takes the standard 2 options.

=over 6

=item B<-n>

Dry run. Don't do anything. Just show what would be done. It won't con
+nect to
the database, it won't insert any files. It will just show you the nam
+es and
dates of all the files that it would try to insert.

=item B<-D>

Old Database. Use the 1.0.12 database schema. Otherwise, the 1.5 schem
+a is
assumed. Use this option if you use Joomla 1.0.12.

=item B<-r>

Recurse. Descend subdirectories and find all their files and import th
+em, too.
Otherwise, subdirectories are skipped and only the files in directory 
+are
inserted into the site.

=item B<-f>

Use the file name as the title, not the <title> from inside the file. 
+The
current regular expression will strip off '.(asp|aspx|htm|html|shtm|sh
+tml)'
and use the remaining file name.

=item B<-F>

B<Append> the filename to the title of the page. For example, if the p
+age contains <title>New Page 1</title> and it is named I<menu.htm>, t
+hen this will set the title (in Joomla) to be "New Page 1 menu". Agai
+n, as in B<-f>, common file extensions are stripped off.

=back

=head1 OUTPUT

You'll see a line or two for each file processed.

=head1 ERRORS

The script blows up on errors. More details will follow.

=head1 DIAGNOSTICS

HTML files that have no <title> tags will be named "Article" unless B<
+-f> or
B<-F> are used.

If you use the wrong database type (e.g., you forget to specify B<-D> 
+when you
should), it will blow up at the SQL level, but keep on truckin. In oth
+er
words, you'll get lots and lots of errors.

=head1 EXAMPLES

perl BulkImport.pl -r html

=over 2

This will go into the B<html> directory and find pages and upload them
+.
Because of B<-r>, it will descend into any subdirectories that are fou
+nd and
process HTML files there, too.

=back

=head1 AUTHOR

Paco Hope <paco@paco.to>

=head1 COPYRIGHT

Copyright (C) 2007 Paco Hope <paco@paco.to> Distributed under the BSD 
+License.
(See the bottom of this file)

Original from http://paco.to/?p=191
Now at http://joomlacode.org/gf/project/bulkimport/

=head1 SEE ALSO

DBI(3pm), DBD::MySQL(3pm), HTMLL::Parser

=head1 ACKNOWLEDGEMENTS

Thanks to David Glah from Cable & Wireless for the SQL update
and the impetus to do recursion.

=head1 INTERNAL DOCUMENTATION

The remainder of this documentation is per-function, internal document
+ation.
It is only intended for the developers and maintainers of this code.

=cut

use strict;
use HTML::Parser;
use POSIX qw(strftime);
use DBI;
use DBD::mysql;
use Getopt::Long;

# Here's the MySQL database stuff you need to configure
$db::user      = "root";
$db::passwd    = "zeroc00l";
$db::database  = "hsrgiglio";
$db::hostname  = "localhost";
$db::port      = "3306";
$db::tablename = "jos_content";
$db::ver       = "1.5";           # default

# state for all articles (1=published)
$j::state = 1;

# numeric Joomla section and category where you want the articles inse
+rted
$j::section  = 1;
$j::category = 1;

# numeric creator ID (62 = admin) for all articles
$j::creator = 62;

# By default, do not recurse. Use -r to enable recursion.
$j::recurse = '';
###########
### No need to change anything below here
###########

# this first bit is right out of the HTML::Parser perldoc
sub title_handler {
    return if shift ne "title";
    my $self = shift;
    $self->handler( text => sub { $j::title = shift }, "dtext" );
    $self->handler(
        end => sub { shift->eof if shift eq "title"; },
        "tagname,self"
    );
}

sub date_handler {
    return if shift ne "date";
    my $self = shift;
    $self->handler( text => sub { $j::date = shift }, "dtext" );
    $self->handler(
        end => sub { shift->eof if shift eq "date"; },
        "tagname,self"
    );
}

#sub date_handler { 
#    my ($self, $tagname, $attr, $attrseq, $origtext) = @_;
#    if ($tagname eq 'date') {
#        end => sub { shift->eof if shift eq "date"; },
#        "tagname,self"
#    }
#}

# Given a file name:
# Parse it for <title>
# Get its date from the filesystem
# Insert it into the Joomla Database
sub insertFile {
    my $file = shift;

    my $p = HTML::Parser->new( api_version => 3 );
    $p->handler( start => \&title_handler, "tagname,self" );
    $p->handler( start => \&date_handler, "tagname,self" );
    $p->parse_file($file);


    # Get the mod time on the file, so we can set the creation time of
+ the
    # Joomla article to that time. This blatently taken from perldoc -
+f stat
    my (
        $dev,  $ino,   $mode,  $nlink, $uid,     $gid, $rdev,
        $size, $atime, $mtime, $ctime, $blksize, $blocks
    ) = stat($file);

    # Break $mtime down into its constituent parts.
    # This taken from perldoc -f localtime
    my ( $sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst )
+ =
      localtime($mtime);

    # make a MySQL compatible date
    my $mysqlDate = strftime(
        "%F %T", $sec,  $min,  $hour, $mday,
        $mon,    $year, $wday, $yday, $isdst
    );

    # Open the file and stick its entire contents into $htmlBody
    my $htmlBody;
    open HTMLFILE, "<$file";
    my $numread = read HTMLFILE, $htmlBody, $size;
    if( $j::dryrun ) {
        print "    Titolo: \"$j::title\"\n";
        # print "    Date: \"$mysqlDate\"\n";
    print "    Data: \"$j::date\"\n";
    } else {
        # $db::sth->execute(
        #    $j::title,    $j::title,  $htmlBody,   $j::state, $j::sec
+tion,
        #    $j::category, $mysqlDate, $j::creator, $mysqlDate
        # );
    $db::sth->execute(
            $j::title,    $j::title,  $htmlBody,   $j::state, $j::sect
+ion,
            $j::category, $j::date, $j::creator, $j::date
         );
    }
}

=pod

=head2 sub processDir

Given a directory, process all the entries in the directory. If we hav
+e -r on
the command line, then we will recurse into directories that we find.
Otherwise, we skip them.

=cut

sub processDir {
    my $dir   = shift;
    my $entry = "";
    my $dirhandle;

    if ( !opendir( $dirhandle, $dir ) ) {
        warn "can't opendir $dir: $! (continuing)";
        return;
    }

    # Go through all the dir entries, but ignore '.' and '..'
    while ( $entry = readdir($dirhandle) ) {
        next if "$entry" eq ".";
        next if "$entry" eq "..";
        if ( -d "$dir/$entry" ) {
            print "Processing directory $dir/$entry\n";

            # if we have a directory, and we want to recurse, call
            # processDir on it.
            if ($j::recurse) {
                processDir("$dir/$entry");
            }
            next;
        }

        # Note that this ignores symbolic links, too.
        next unless -f "$dir/$entry";

        print "  + $entry\n";
        insertFile("$dir/$entry");
    }
    closedir DIR;
}

###
### Begin Main
###

# Default title for our articles, if one isn't defined in the HTML
$j::title = "Article";

# Process command line arguments
GetOptions( 'r' => \$j::recurse,
             'D' => \$j::dbver,
            'f' => \$j::useFileName,
            'F' => \$j::appendFileName,
             'n' => \$j::dryrun );

if( $j::dbver ) {
    # -D means use old database (1.0.12)
    $db::ver = "1.0.12";
}

if ( $#ARGV != 0 ) {
    die "need a directory name ($#ARGV)";
}
else {
    $j::dir = $ARGV[0];
    if ( !-r $j::dir ) {
        die "can't open $j::dir";
    }
    if ( !-d $j::dir ) {
        die "$j::dir is not a directory";
    }
}

$db::dsn = "DBI:mysql:database=$db::database;host=$db::hostname";
if( $j::dryrun ) {
    print "Would connect to $db::dsn with $db::user and pass xxxx\n";
} else {
    $db::dbh = DBI->connect( $db::dsn, $db::user, $db::passwd );
}

# Now build up the query
my $q = "INSERT INTO `$db::tablename` VALUES ";

# first int is the autoincrement field. We assume that will be set by 
+MySQL
# date: 2007-07-04 21:07:51

# Depending on which version we've been asked to do
if ( $db::ver eq "1.0.12" ) {
    $q .= "(null, ?, ?, ?, '', ?, ?, 0, ?, ?, ?, '', '0000-00-00 00:00
+:00', ";
    $q .= "0, 0, '0000-00-00 00:00:00', ?, '0000-00-00 00:00:00', '', 
+'', ";
    $q .= "'pageclass_sfx=\\nback_button=\\nitem_title=1\\nlink_titles
+=\\nintrotext=1\\n";
    $q .= "section=0\\nsection_link=0\\ncategory=0\\ncategory_link=0\\
+nrating=\\nauthor=\\n";
    $q .= "createdate=\\nmodifydate=\\npdf=\\nprint=\\nemail=\\nkeyref
+=\\ndocbook_type=', ";
    $q .= "1, 0, 1, '', '', 0, 0)";
}
elsif ( $db::ver eq "1.5" ) {
    $q .= "(null, ?, '', ?, ?, '', ?, ?, 0, ?, ?, ?, '', '0000-00-00 0
+0:00:00', ";
    $q .= "0, 0, '0000-00-00 00:00:00', ?, '0000-00-00 00:00:00', '', 
+'', ";
    $q .= "'pageclass_sfx=\\nback_button=\\nitem_title=1\\nlink_titles
+=\\nintrotext=1\\n";
    $q .= "section=0\\nsection_link=0\\ncategory=0\\ncategory_link=0\\
+nrating=\\nauthor=\\n";
    $q .= "createdate=\\nmodifydate=\\npdf=\\nprint=\\nemail=\\nkeyref
+=\\ndocbook_type=', ";
    $q .= "1, 0, 1, '', '', 0, 0,'')";
}

if( $j::dryrun ) {
    print "Using Joomla database schema for version $db::ver\n";
} else {
    # Prepare the query  once. We'll execute it many times.
    $db::sth = $db::dbh->prepare($q);
}

print "processing '$j::dir'\n";
processDir($j::dir);
if( ! $j::dryrun ) {
    $db::dbh->disconnect;
}

=pod

=head1 LICENSE

License Terms for this file. This is the BSD License.
(http://opensource.org/licenses/bsd-license.php)
Copyright (c) 2007, Paco Hope
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
+ met:

=over 2

=item - 

Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.

=item -

Redistributions in binary form must reproduce the above copyright noti
+ce,
this list of conditions and the following disclaimer in the documentat
+ion
and/or other materials provided with the distribution.

=item -

Neither the name of Paco Hope nor the names of its contributors may be
used to endorse or promote products derived from this software without
specific prior written permission.

=back

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "A
+S IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, 
+THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PUR
+POSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS 
+BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINE
+SS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER I
+N
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE
+)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
+ THE
POSSIBILITY OF SUCH DAMAGE.

=cut
[download]

In reply to Perl Newbie by zeroc00l

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl: the Markov chain saw
	PerlMonks