http://qs321.pair.com?node_id=334559

perleager has asked for the wisdom of the Perl Monks concerning the following question:

Hey,

I just decided to embrace on learning everything about LWP :)

I went out to buy the Perl & LWP book to start out with learning some parsing by extracting headlines from a given news site. In this book, the chapter that's about using Tokens to extract headlines; they use the bbc news site for the example site to retrieve headlines. However, since the example no longer works due to the different html coding for each headline, I decided to use reuters news headlines (reuters business section). I'm having a bit trouble with my coding. The problem with the code I'm using is it prints out nothing, therefore I'm figuring I'm not doing the toking part right (I do have all the modules installed).

So first thing to do, I looked for the headlines in the source. I found the pattern goes as:

<tr><td class="earlyHeadline"><a href="newsArticle.jhtml?t +ype=businessNews&storyID=4511892&section=news">SEC Targets More Fortu +ne 500 Names</a></td></tr> ...etc etc as each headline is displayed


Heres my following code to extract the headlines using HTML::TokeParser :

#!/usr/bin/perl -w use strict; use HTML::TokeParser; use LWP::Simple; print "Content-type: text/html\n\n"; my $filename = 'temp.html'; open FH, ">$filename"; print FH get("http://www.reuters.com/newsEarlierArticles.jhtml?type=bu +sinessNews"); close FH; my $stream = HTML::TokeParser->new('$filename') || die "Couldn't read HTML file $filename: $!"; while(my $token = $stream->get_token) { if ($token->[0] eq 'S' and $token->[1] eq 'td' and ($token->[2]{'class'} || '') eq 'earlyHeadline') { my(@next) = ($stream->get_token); if ($next[0] and $next[0][0] eq 'S' and $next[0][1] eq 'a' and defi +ned $next[0][2]{'href'} ) { #early headline found for business section/grab a href portion print URI->new_abs($next[0][2]{'href'}, $filename), "\n"; next Token; } } }


The code looks for the <td class="earlyHeadline">, then the next portion looks for the "a href" part. Then the line where it prints out the url is printing out nothing =(. Can anyone point out what I'm doing wrong? Am I even on the right track?

Thanks,

Anthony