Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Hey,

I just decided to embrace on learning everything about LWP :)

I went out to buy the Perl & LWP book to start out with learning some parsing by extracting headlines from a given news site. In this book, the chapter that's about using Tokens to extract headlines; they use the bbc news site for the example site to retrieve headlines. However, since the example no longer works due to the different html coding for each headline, I decided to use reuters news headlines (reuters business section). I'm having a bit trouble with my coding. The problem with the code I'm using is it prints out nothing, therefore I'm figuring I'm not doing the toking part right (I do have all the modules installed).

So first thing to do, I looked for the headlines in the source. I found the pattern goes as:

<tr><td class="earlyHeadline"><a href="newsArticle.jhtml?t +ype=businessNews&storyID=4511892&section=news">SEC Targets More Fortu +ne 500 Names</a></td></tr> ...etc etc as each headline is displayed


Heres my following code to extract the headlines using HTML::TokeParser :

#!/usr/bin/perl -w use strict; use HTML::TokeParser; use LWP::Simple; print "Content-type: text/html\n\n"; my $filename = 'temp.html'; open FH, ">$filename"; print FH get("http://www.reuters.com/newsEarlierArticles.jhtml?type=bu +sinessNews"); close FH; my $stream = HTML::TokeParser->new('$filename') || die "Couldn't read HTML file $filename: $!"; while(my $token = $stream->get_token) { if ($token->[0] eq 'S' and $token->[1] eq 'td' and ($token->[2]{'class'} || '') eq 'earlyHeadline') { my(@next) = ($stream->get_token); if ($next[0] and $next[0][0] eq 'S' and $next[0][1] eq 'a' and defi +ned $next[0][2]{'href'} ) { #early headline found for business section/grab a href portion print URI->new_abs($next[0][2]{'href'}, $filename), "\n"; next Token; } } }


The code looks for the <td class="earlyHeadline">, then the next portion looks for the "a href" part. Then the line where it prints out the url is printing out nothing =(. Can anyone point out what I'm doing wrong? Am I even on the right track?

Thanks,

Anthony

In reply to HTML::TokeParser help - parsing headlines by perleager

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (5)
As of 2024-04-20 00:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found