Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

HTML::TokeParser help - parsing headlines

by perleager (Pilgrim)
on Mar 07, 2004 at 00:33 UTC ( #334559=perlquestion: print w/replies, xml ) Need Help??

perleager has asked for the wisdom of the Perl Monks concerning the following question:

Hey,

I just decided to embrace on learning everything about LWP :)

I went out to buy the Perl & LWP book to start out with learning some parsing by extracting headlines from a given news site. In this book, the chapter that's about using Tokens to extract headlines; they use the bbc news site for the example site to retrieve headlines. However, since the example no longer works due to the different html coding for each headline, I decided to use reuters news headlines (reuters business section). I'm having a bit trouble with my coding. The problem with the code I'm using is it prints out nothing, therefore I'm figuring I'm not doing the toking part right (I do have all the modules installed).

So first thing to do, I looked for the headlines in the source. I found the pattern goes as:

<tr><td class="earlyHeadline"><a href="newsArticle.jhtml?t +ype=businessNews&storyID=4511892&section=news">SEC Targets More Fortu +ne 500 Names</a></td></tr> ...etc etc as each headline is displayed


Heres my following code to extract the headlines using HTML::TokeParser :

#!/usr/bin/perl -w use strict; use HTML::TokeParser; use LWP::Simple; print "Content-type: text/html\n\n"; my $filename = 'temp.html'; open FH, ">$filename"; print FH get("http://www.reuters.com/newsEarlierArticles.jhtml?type=bu +sinessNews"); close FH; my $stream = HTML::TokeParser->new('$filename') || die "Couldn't read HTML file $filename: $!"; while(my $token = $stream->get_token) { if ($token->[0] eq 'S' and $token->[1] eq 'td' and ($token->[2]{'class'} || '') eq 'earlyHeadline') { my(@next) = ($stream->get_token); if ($next[0] and $next[0][0] eq 'S' and $next[0][1] eq 'a' and defi +ned $next[0][2]{'href'} ) { #early headline found for business section/grab a href portion print URI->new_abs($next[0][2]{'href'}, $filename), "\n"; next Token; } } }


The code looks for the <td class="earlyHeadline">, then the next portion looks for the "a href" part. Then the line where it prints out the url is printing out nothing =(. Can anyone point out what I'm doing wrong? Am I even on the right track?

Thanks,

Anthony

Replies are listed 'Best First'.
Re: HTML::TokeParser help - parsing headlines
by Enlil (Parson) on Mar 07, 2004 at 01:41 UTC

    I believe you are on the right track. The first thing that I see that you are doing wrong is at the following line:

    my $stream = HTML::TokeParser->new('$filename') || die "Couldn't read HTML file $filename: $!";
    Since $filename is enclosed in single quotes it will not interpolate and you are thus looking for a file called literally $filename instead of the just created file called: 'temp.html'

    Second, you have:

    print URI->new_abs($next[0][2]{'href'}, $filename), "\n";
    But don't have use URI; at the top of your file. So the package/method is missing when you call it.

    Lastly, you have next Token;, but you don't have a label Token, and you realistically don't need the next their either as it will immediately go into the next loop whether or not the next is there.

    Upon making these three changes i believe the code will work as you intend.

    -enlil

Re: HTML::TokeParser help - parsing headlines
by Ovid (Cardinal) on Mar 07, 2004 at 04:21 UTC

    If you switch to HTML::TokeParser::Simple, I think you'll be happy with how much clearer the logic is.

    use strict; use HTML::TokeParser::Simple; use LWP::Simple; use URI; my $url = 'http://www.reuters.com/newsEarlierArticles.jhtml?type=busin +essNews'; my $stream = HTML::TokeParser::Simple->new(\get($url)) || die "Couldn't read $url: $!"; while(my $token = $stream->get_token) { next unless $token->is_start_tag('td') and ($token->return_attr('class') || '') eq 'earlyHeadline'; my $next = $stream->get_token; if ($next->is_start_tag('a')) { print URI->new_abs($next->return_attr('href'), $url), "\n"; } }

    Cheers,
    Ovid

    New address of my CGI Course.

      Hey,

      I adjusted my code correctly to extract the urls from the reauters headlines.

      However, when printing out the urls it looks like:

      http://www.reuters.com/newsArticle.jhtml;jsessionid=1GRGO0RUSCREMCRBAE +0CFFA?type=businessNews&storyID=4512094告on=news http://www.reuters.com/newsArticle.jhtml;jsessionid=1GRGO0RUSCREMCRBAE +0CFFA?type=businessNews&storyID=4512054告on=news http://www.reuters.com/newsArticle.jhtml;jsessionid=1GRGO0RUSCREMCRBAE +0CFFA?type=businessNews&storyID=4512041告on=news
      If you copy and paste one of those url's, it will bring you to a blank reuters template, part being because at the end part of the url where it has "告on=news", should really be "'&'section=news".

      Somehow its translating the "'&'section=news" into "告on=news".

      Could it be because I'm using MIME-Base32 and not MIME-Base64 module? --I'm on a Windows machine.

      Adjusted code:
      #!/usr/bin/perl -w use strict; use HTML::TokeParser; use LWP::Simple; use URI; print "Content-type: text/html\n\n"; my $filename = 'temp.html'; open FH, ">$filename"; print FH get("http://www.reuters.com/newsEarlierArticles.jhtml?type=bu +sinessNews"); close FH; my $stream = HTML::TokeParser->new($filename) || die "Couldn't read HTML file $filename: $!"; while(my $token = $stream->get_token) { if ($token->[0] eq 'S' and $token->[1] eq 'td' and ($token->[2]{'class'} || '') eq 'earlyHeadline') { my(@next) = ($stream->get_token); if ($next[0] and $next[0][0] eq 'S' and $next[0][1] eq 'a' and defined + $next[0][2]{'href'} ) { #early headline found for business section/grab a href portion print URI->new_abs($next[0][2]{'href'}, 'http://www.reuter +s.com/'), "\n"; } } }
      Thank you,
      Anthony
Re: HTML::TokeParser help - parsing headlines
by Popcorn Dave (Abbot) on Mar 07, 2004 at 01:26 UTC
    Take a look at my scratchpad. There's a Perl program there let you see exactly what you're getting from HTML::TokeParser. You'll quickly see what tokens are assigned where and what you need to look for in the web source.

    I used it when I was doing something very similar to what you're doing for parsing headlines on multiple web sites and it made the whole process quite easy.

    Hope that helps!

    Update: Thanks to suggestions from b10m and graff I'm including the code here so future monks can find it in a super search.

    #!/usr/bin/perl -w # HTML::TokeParser dumper # # quick & dirty code to print out TokeParser output use strict; use HTML::TokeParser; use LWP::Simple; print "Content-type: text/html\n\n"; my $filename = 'temp.html'; open FH, ">$filename"; print FH get("http://www.buchanie.co.uk/news.asp"); close FH; my $stream = HTML::TokeParser->new($filename) || die "Couldn't read HTML file $filename: $!"; while(my $token = $stream->get_token) { if ($token->[0] eq "S"){ print "Token:S 1:$token->[1]\n"; foreach my $key(keys %{$token->[2]}){ print "Key: $key Value: ${$token->[2]}{$key}\n"; } print "3: @{$token->[3]}\n4: $token->[4]\n\n"; } elsif ($token->[0] eq "E"){ print "Token:E 1:$token->[1] 2: $token->[2]\n\n"; } elsif ($token->[0] eq "T"){ print "Token:T 1:$token->[1]\n\n"; } elsif ($token->[0] eq "C"){ print "Token:C 1:$token->[1]\n\n"; } elsif ($token->[0] eq "D"){ print "Token:D 1:$token->[1]\n\n"; } else {print "Unknown token $token\n\n";} }

    There is no emoticon for what I'm feeling now.

      Rather than providing a link to your scratchpad, why not post that code in some more stable wing of the Monastery (or include it in your reply), to make it a stable reference? People are likely to find this thread in a search for tips on HTML parsing at any time over the coming months or years, and you're likely to have put something else on your scratch pad by then...
        Actually b10m suggested the same thing so I'm taking the advice of both of you and updating my node. :)

        There is no emoticon for what I'm feeling now.

Re: HTML::TokeParser help - parsing headlines
by sheep (Chaplain) on Mar 07, 2004 at 01:50 UTC

    Hello,

    One additional thing to what Enlil said:
    URI->new_abs($next[0][2]{'href'}, $filename),
    you are calling it with $filename as the base,
    but the base for your URL is "http://www.reuters.com/", not your temporary file name.

    -Sheep

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://334559]
Approved by kvale
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2022-05-27 04:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (94 votes). Check out past polls.

    Notices?