comment on

I recently made a little script to come up with an index for the increasingly unwieldy Name Space thread. This led me into an area I've not explored before (one of many, I hasten to add), that of extracting info from HTML. I wanted something that wd go through the page and pull out a node number and name for the first post by each monk who had contributed to the thread. What I came up with was this:

#!/usr/bin/perl -w

use strict;
use CGI qw(:standard :cgi-lib);
use LWP::Simple;

my $url ="http://perlmonks.org/index.pl?node_id=110166";
my $html = get $url or die "can't get url $!";
my %names;

#find the names and node ids:
while ($html =~ s/=(\d*?)&lastnode_id=110166">[^<]*<\/A><BR> by <A HRE
+F="\/index\.pl\?node_id=\d*&lastnode_id=110166">(.*?)<\/A> on \w{3} \
+d{2}, \d{4} at \d{2}:\d{2}//s) {
    $names{$2} = $1 unless $names{$2};
}

# print out a page of links to nodes:
for (sort { lc($a) cmp lc($b) } keys %names) {
    print "<A HREF=\"/index.pl?node_id=$names{$_}&lastnode_id=110166\"
+>$_</A> | ";
}
[download]

... which does the job, BUT the regex is big and fat and ugly, and I just wondered whether there was a more elegant, less impenetrable way to do it (i.e. I wondered how *many* such ways there were). I looked at HTML::Parser, but (and perhaps my inspection was too cursory) it didn't seem as though it wd help me much in pulling out bits of tags, as I need to here. Also, I felt that the while loop was a bit clumsy... but couldn't see a quicker way to capture two matches into a hash. I'd be very interested in any suggestions how to do this better.

§ George Sherston

Edit: chipmunk 2001-11-11

In reply to Extract info from HTML by George_Sherston

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Think about Loose Coupling
	PerlMonks