Hey,
I just decided to embrace on learning everything about LWP :)
I went out to buy the Perl & LWP book to start out with learning some parsing by extracting headlines from a given news site. In this book, the chapter that's about using Tokens to extract headlines; they use the bbc news site for the example site to retrieve headlines. However, since the example no longer works due to the different html coding for each headline, I decided to use reuters news headlines (
reuters business section). I'm having a bit trouble with my coding. The problem with the code I'm using is it prints out nothing, therefore I'm figuring I'm not doing the toking part right (I do have all the modules installed).
So first thing to do, I looked for the headlines in the source. I found the pattern goes as:
<tr><td class="earlyHeadline"><a href="newsArticle.jhtml?t
+ype=businessNews&storyID=4511892§ion=news">SEC Targets More Fortu
+ne 500 Names</a></td></tr>
...etc etc as each headline is displayed
Heres my following code to extract the headlines using HTML::TokeParser :
#!/usr/bin/perl -w
use strict;
use HTML::TokeParser;
use LWP::Simple;
print "Content-type: text/html\n\n";
my $filename = 'temp.html';
open FH, ">$filename";
print FH get("http://www.reuters.com/newsEarlierArticles.jhtml?type=bu
+sinessNews");
close FH;
my $stream = HTML::TokeParser->new('$filename')
|| die "Couldn't read HTML file $filename: $!";
while(my $token = $stream->get_token) {
if ($token->[0] eq 'S' and $token->[1] eq 'td' and
($token->[2]{'class'} || '') eq 'earlyHeadline') {
my(@next) = ($stream->get_token);
if ($next[0] and $next[0][0] eq 'S' and $next[0][1] eq 'a' and defi
+ned $next[0][2]{'href'} ) {
#early headline found for business section/grab a href portion
print URI->new_abs($next[0][2]{'href'}, $filename), "\n";
next Token;
}
}
}
The code looks for the
<td class="earlyHeadline">, then the next portion looks for the "a href" part. Then the line where it prints out the url is printing out nothing =(. Can anyone point out what I'm doing wrong? Am I even on the right track?
Thanks,
Anthony
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.