Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: HTML::TokeParser help - parsing headlines

by Popcorn Dave (Abbot)
on Mar 07, 2004 at 01:26 UTC ( [id://334563]=note: print w/replies, xml ) Need Help??


in reply to HTML::TokeParser help - parsing headlines

Take a look at my scratchpad. There's a Perl program there let you see exactly what you're getting from HTML::TokeParser. You'll quickly see what tokens are assigned where and what you need to look for in the web source.

I used it when I was doing something very similar to what you're doing for parsing headlines on multiple web sites and it made the whole process quite easy.

Hope that helps!

Update: Thanks to suggestions from b10m and graff I'm including the code here so future monks can find it in a super search.

#!/usr/bin/perl -w # HTML::TokeParser dumper # # quick & dirty code to print out TokeParser output use strict; use HTML::TokeParser; use LWP::Simple; print "Content-type: text/html\n\n"; my $filename = 'temp.html'; open FH, ">$filename"; print FH get("http://www.buchanie.co.uk/news.asp"); close FH; my $stream = HTML::TokeParser->new($filename) || die "Couldn't read HTML file $filename: $!"; while(my $token = $stream->get_token) { if ($token->[0] eq "S"){ print "Token:S 1:$token->[1]\n"; foreach my $key(keys %{$token->[2]}){ print "Key: $key Value: ${$token->[2]}{$key}\n"; } print "3: @{$token->[3]}\n4: $token->[4]\n\n"; } elsif ($token->[0] eq "E"){ print "Token:E 1:$token->[1] 2: $token->[2]\n\n"; } elsif ($token->[0] eq "T"){ print "Token:T 1:$token->[1]\n\n"; } elsif ($token->[0] eq "C"){ print "Token:C 1:$token->[1]\n\n"; } elsif ($token->[0] eq "D"){ print "Token:D 1:$token->[1]\n\n"; } else {print "Unknown token $token\n\n";} }

There is no emoticon for what I'm feeling now.

Replies are listed 'Best First'.
Re: Re: HTML::TokeParser help - parsing headlines
by graff (Chancellor) on Mar 07, 2004 at 05:36 UTC
    Rather than providing a link to your scratchpad, why not post that code in some more stable wing of the Monastery (or include it in your reply), to make it a stable reference? People are likely to find this thread in a search for tips on HTML parsing at any time over the coming months or years, and you're likely to have put something else on your scratch pad by then...
      Actually b10m suggested the same thing so I'm taking the advice of both of you and updating my node. :)

      There is no emoticon for what I'm feeling now.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://334563]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-25 22:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found