Syntactic Confectionery Delight | |
PerlMonks |
HTML::TokeParser Tutorialby crazyinsomniac (Prior) |
on Jul 24, 2001 at 11:16 UTC ( [id://99254]=perltutorial: print w/replies, xml ) | Need Help?? |
NAMEHTML::TokeParser Tutorial (a.k.a. The CPAN Nodelet Faker)
DESCRIPTIONWant to parse HTML the right (and easy) way? Well read this tutorial and you can!!! (I'd like to thank damian1301 and derek3000 for asking for help, which made me read the pod, and eventually write this.)
The CPAN Nodelet Faker (What's It Do?)My example program, The CPAN Nodelet Faker, besides teaching you how to use HTML::TokeParser, fetches the latest 20 modules added to http://search.cpan.org/recent. You can download the source code (without the line numbers, ready to run), as well as this tutorial and sample input/output from http://crazyinsomniac.perlmonk.org/perl/htmltokeparsertutorial
Why Didn't I just use HTML::LinkExtor?This is an HTML::TokeParser tutorial. Besides, HTML::TokeParser will fit most, if not all, your HTML parsing needs. And, anyway, HTML::LinkExtor is built on top of HTML::Parser just like HTML::TokeParser.
HTML::TokeParserMy comments begin with # and are italicized.
DESCRIPTION (mostly verbatim from the pod)HTML::TokeParser - Alternative HTML::Parser interface # What's an n worth to ya -- why couldn't he just call it TokenParser? # Maybe he's a hesher, who knows? The HTML::TokeParser is an alternative interface to the HTML::Parser class. It basically turns the HTML::Parser inside out. You associate a file (or any IO::Handle object or string) with the parser at construction time and then repeatedly call $parser->get_token to obtain the tags and text found in the parsed document. No need to make a subclass to make the parser do anything. Calling the methods defined by the HTML::Parser base class will be confusing, so don't do that. Use the following methods instead:
FUNCTIONS
TRIGGERS (there are only two)The first trigger looks like: <a href="/search?dist=cyrillic-2.08">cyrillic-2.08</a> We're looking for a ``S''tarting tag, that is called ``a'', and whose, href attribute begins with /search?dist= The third trigger (:€:) looks like: <tr><td colspan=2> 115 distributions have been uploaded since 15th July 2001 </td></tr> We're looking for a ``S''tart tag, that is called ``td'', which has a ``colspan'' attribute whose value is ``2'' The catch phrase is distributions have been uploaded
LINE-by-LINE CODE EXPLANATIONLines 1-5: self explanatory (see perlman if you don't understand) Lines 6-8: unbuffer output (autoflush) Line 9: $cpanurl is the url of the recently added CPAN modules Lines 11-13: Declare the array that will contain the latest 20 modules. Initialize the scalar that will contain the number of modules that were added, along with the date. Attempt to ``download'' the page, and load it's contents into $rawHTML using LWP::Simple::get. Line 15: check to make sure get($cpanurl) returned something. We don't wanna create an entire HTML::TokeParser object, if we have no data to feed it. Line 18: create a new HTML::TokeParser object ($tp). The die statement is left-over, from when I passed it a filename, but it doesn't hurt much, and something can always go wrong. ---Lines 22-77:START like Line 21 says, a generic HTML::TokeParser loop;º) Line 25: dereference $token, shift the first value (tag type), save it to $ttype. Line 27: check to see if we have a start tag (as if you couldn't tell) Line 29: since it was a start tag, $token is supposed to have 4 more values for us (which for clarity, we've named $tag, $attr, $attrseq, $rawtxt) Line 31: check to see if we have an anchor(link) Lines 32-36: since we have an anchor, fetch the value of href, as well as the text in between the opening and closing anchor tag. Since there can be other tokens in between (ex: <a href=""> ... <B>...</a>), even though this particular page won't have any, we use the explicit $tp->get_trimmed_text("/a"); Lines 40-42: push onto @newest20 an array reference, containing the value of the href attribute of our anchor, as well as the text enclosed by the anchor, but only if the href attribute contains our first trigger (/search?dist=) Line 44: Since our $tag was not an anchor, we test to see if it is a ``td'' with a colspan of 2 (our third trigger). Lines 48: Since we do have $tag that fits the general description, we go ahead and get the trimmed text up until the next token. (Comments follow, of the same importance as those on Lines 32-36) Lines 58-59: if the trimmed text ($p_text) contains the catch phrase from our third trigger, se assign it to $lastupdated, thus completing half of our task. Lines 61-73: if it's not a start tag, check to see if it's any other tag we recognize, and do nothing with that information, since for this particular program, we don't need to. Line 75: break out of the while loop, if we got our latest 20 modules. ---Lines 22-77:END the end of the generic HTML::TokeParser loop. Lines 79-80: at this point we don't need $rawHTML or $tp anymore, and since they're not going out of scope till the end of the program, we explicitly undef. Line 82: output the number of distributions that have been uploaded, but only if we were able to extract that information ($lastupdated contains something). Lines 84-91: loop through @newest20 perl style, and output html anchors to the modules. Line 93-94: It never hurts to be explicit(end of the program).
LINE NUMBERED CODE LISTING1: #!/usr/bin/perl -w 2: 3: use strict ; # fun with whitespace 4: use LWP::Simple; # what's that? {provides get($url), just `perldoc`} 5: require HTML::TokeParser; # Why? because 6: 7: $|=1; # un buffer 8: 9: my $cpanurl = 'http://search.cpan.org/recent'; 10: 11: my @newest20; # the top 20 12: my $lastupdated = ''; # $n distributions have been uploaded since $date 13: my $rawHTML = get($cpanurl); # attempt to d/l the page to mem 14: 15: die "LWP::Simple messed up $!" unless ($rawHTML); 16: # Habit. if it's empty, TokeParser would notice 17: 18: my $tp = HTML::TokeParser->new(\$rawHTML) || die "Can't open: $!"; 19: 20: 21: # And now -- a generic HTML::TokeParser loop 22: 23: while (my $token = $tp->get_token) 24: { 25: my $ttype = shift @{ $token }; 26: 27: if($ttype eq "S") # start tag? 28: { 29: my($tag, $attr, $attrseq, $rawtxt) = @{ $token }; 30: 31: if($tag eq "a") 32: { 33: my $a_href = $attr->{'href'}; 34: my $a_encl = $tp->get_trimmed_text("/$tag"); 35: 36: # be sure you understand what get_trimmed_text or get_text are doing 37: # calling either (as well as get_tag) can drastically change 38: # the curser position 39: # in general calling the no argument version, is preferable here 40: 41: push ( @newest20 , [ $a_href, $a_encl ] ) 42: if( $a_href =~ /\/search\?dist\=/ ); 43: } 44: elsif( ($tag eq "td") and ($rawtxt =~ /colspan=2/m) ) 45: { 46: # as opposed to checking the hash like exists $attr->{colspan} 47: 48: my $p_text = $tp->get_trimmed_text; # p for potential 49: 50: # fetches the "trimmed" up until the next "token" 51: # passing /td to get_trimmed_text is not advisable, because 52: # TokeParser would slurp all the text until the next closing /td 53: # which would in effect cause us to skip halfway down the file 54: # missing our target links (and pretty much all of them) 55: # we could always call unget_token, but this is hard. 56: # like swimming up river (but not as enojoyable) 57: 58: $lastupdated = $p_text 59: if($p_text =~ /distributions have been uploaded/m); 60: } 61: } # since we know what we're looking for, no need for the rest of these 62: elsif($ttype eq "T") # text? 63: { 64: } 65: elsif($ttype eq "C") # comment? 66: { 67: } 68: elsif($ttype eq "E") # end tag? 69: { 70: } 71: elsif($ttype eq "D") # declaration? 72: { 73: } 74: 75: last if(scalar @newest20 == 20); # we disappear once we get 20 76: 77: } # endof while (my $token = $p->get_token) 78: 79: undef $rawHTML; # no more raw html 80: undef $tp; # destroy the HTML::TokeParser object (don't need it no more) 81: 82: print "<H5> $lastupdated </H5>\n" if($lastupdated); # just in case we miss it 83: 84: for my $arayref (@newest20) 85: { 86: print "<A HREF='http://search.cpan.org", 87: $arayref->[0], # the link straingt from href 88: "'>", 89: $arayref->[1], # the link text 90: "</A><BR>\n"; 91: } 92: 93: exit; 94: __END__
Song in A minorAM came from out the maze Hitch-hiked on a 56k Scratched his head, then tickled his 'board Scratched his ass, and then was bored He said, hey baby, PLEASE! do my work for me She said, no way baby, i'm not that lonely And the perled monks go: doo doo doo.. Crazy came from planet x Saw some monk showin' his pecks Scratched his head, then pounded his 'board Checked politely, consider this node He said, hey troll, take a walk on to /dev/null Troll said, what, hey i'm not dumb And the pereld monks go: dasright R TT FF MM.
Back to
Tutorials
|
|