http://qs321.pair.com?node_id=89678

arashi has asked for the wisdom of the Perl Monks concerning the following question:

I would like to search a series of directories and sub-directories for HTML files and return the file names and title tags into an array.

I think I'm clear on the logic of the problem (open dir, read files, parse data, read next dir), however I am currently at a loss as to how to implement it. Could someone please point me in the right direction, I spent some time this morning looking for documentation on how to do this, but I'm still at a loss.

Thanks for your help.

Arashi

I'm sure Edison turned himself a lot of colors before he invented the lightbulb. - H.S.
  • Comment on Searching directories for HTML title tags

Replies are listed 'Best First'.
Re: Searching directories for HTML title tags
by bikeNomad (Priest) on Jun 19, 2001 at 20:15 UTC
Re: Searching directories for HTML title tags
by jorg (Friar) on Jun 19, 2001 at 20:15 UTC
    To set you on your way :
  • Use File::Find to recurse through subdirectories, the module docs provide very clear examples on its simple usage. You will have access to each file in the directory, match on .html or .htm to get only the files you want.
  • To get to the title tag you might be able to use HTML::Parser

    enjoy!

    Update : bah bikenomad is clearly psychic as he posted exactly the same suggestions.... we are actually the same person, i'm Dr. Jekyll

    Jorg

    "Do or do not, there is no try" -- Yoda
Re: Searching directories for HTML title tags
by arashi (Priest) on Jun 20, 2001 at 00:53 UTC
    I'd like to thank everyone who offered their input for my problem, I got everything working. And no, this wasn't for a class, it was a "busy-work" assignment for work that got handed down, I wanted to use PERL to both save time, and learn something new.

    Here is my completed code:

    use strict; use warnings; use diagnostics; use File::Find; use HTML::HeadParser; my $parser = new HTML::HeadParser; my @data; my $path = '/base/path'; &main; sub main { find(\&html_files, $path); open OUT, "+>filelist.html" || die "Can not write file"; print OUT '<html><head><title>File List</title></head><body><center>' +, "\n", '<table border="1" cellpadding="5" cellspacing="0">', "\n"; foreach my $file(sort @data) { my $htmlPage = &fileRead("<$file"); $parser->parse($htmlPage); my $pageTitle = $parser->header('Title'); if ($pageTitle eq "") { $pageTitle = '&nbsp;'; } print OUT '<tr><td>', "\L$file\E", '</td><td>', $pageTitle +, '</td></tr>', "\n"; } print OUT '</table></body></html>'; close OUT; } sub html_files { push @data, $File::Find::name if /\.s?html?$/; push @data, $File::Find::name if /\.s?HTML?$/; push @data, $File::Find::name if /\.s?htm?$/; push @data, $File::Find::name if /\.s?HTM?$/; } sub fileRead { my ($file) = @_; my $dataIn = undef; open IN, $file || die "Can not open $file"; while (<IN>) { my $temp = $_; $dataIn = $dataIn.$temp; } close IN; return $dataIn; }
    Arashi

    I'm sure Edison turned himself a lot of colors before he invented the lightbulb. - H.S.
      A couple of quick notes. You can sharpen the regex in html_files():
      sub html_files { push @data, $File::Find::name if /\.s?html?$/i; }
      I'd pass in $File::Find::name as a parameter just to encapsulate things further. merlyn might point out that using $ as an anchor will break if there's a newline at the end of the filename, but that shouldn't be a problem. (\z is safer.) Finally, I don't know why you have m?, but using /i makes it the regex case-insensitive. Saves time.

      I'd also get rid of $temp in fileRead(). Always bugs me. :)

Re: Searching directories for HTML title tags
by arturo (Vicar) on Jun 19, 2001 at 20:16 UTC

    I'll just give you some structural information, as I haven't used the latest version of HTML::Parser, a CPAN module that you should seriously consider using.

    One component is File::Find. You can use that to find .html, .htm, .shtml, and whatever extensions consistute "being an HTML file" as far as your webserver is concerned. File::Find</code> will recurse through subdirectories, and it's easy enough to get it to return an array of filenames. As far as finding <title> tags, the most robust solution would be to use HTML::Parser, which takes a lot of different oddities of HTML code into account (e.g. what if the content of the tag extends over two lines?). When you say you want to return the data "in an array", I'm assuming that what you want to do is store two pieces of data for each file: the name of the file, and the content of the title tag therein. Depending on your needs, you might try storing this information as a hash, where the keys are the filenames and the values are the corresponding titles. The following code will get you an array of HTML files:

    Update code below now uses the correct $File::Find::name, which contains the full path to the file, rather than $_, which is just the name of the file.

    use File::Find; use HTML::Parser; # this code doesn't make use of the module, but I r +eally think you should use it in your code =) my @data; find(\&html_files, "/base/path"); # now process @data, which is a list of filenames sub html_files { push @data, $File::Find::name if /\.s?html?$/; }

    HTH

    perl -e 'print "How sweet does a rose smell? "; chomp ($n = <STDIN>); +$rose = "smells sweet to degree $n"; *other_name = *rose; print "$oth +er_name\n"'
Re: Searching directories for HTML title tags
by clemburg (Curate) on Jun 19, 2001 at 20:16 UTC

    I hope this little piece of Windows Perl is enough to start you up ... assuming it's not homework ...

    > perl -MFile::Find -le "sub wanted {return unless /\.html?$/; my $f = + $_; open(FH, '<'.$f); $/=undef; return unless <FH> =~ /<title>([^<]* +)<\/title>/i; print $1.' -- in: '.$File::Find::dir.$_;}; find(\&wante +d, '.')"

    Christian Lemburg
    Brainbench MVP for Perl
    http://www.brainbench.com

Re: Searching directories for HTML title tags
by wog (Curate) on Jun 19, 2001 at 20:33 UTC
    In addition to the HTML::Parser module mentioned, there is also a module specialized for a task like this called HTML::HeadParser that might be better.
Re: Searching directories for HTML title tags
by dimmesdale (Friar) on Jun 19, 2001 at 20:18 UTC
    Look on CPAN for the module File::Find. It can traverse the directories for you, saving yourself the code. Then you can just grep the files you find w/ this regex /html?/. If you don't want to intsall File::Find, then your best bet would be to implement a recursive subroutine that returns as a list all the file names (possibly as a hash of hashes if you want to provide more information about the files). If there's anything else, just ask.

    Well, it appears that many people beat me to the question, and all offered similar advice(if not better). Oh, well.