Take a look at my scratchpad. There's a Perl program there let you see exactly what you're getting from HTML::TokeParser. You'll quickly see what tokens are assigned where and what you need to look for in the web source.
I used it when I was doing something very similar to what you're doing for parsing headlines on multiple web sites and it made the whole process quite easy.
Hope that helps!
Update: Thanks to suggestions from b10m and graff I'm including the code here so future monks can find it in a super search.
#!/usr/bin/perl -w
# HTML::TokeParser dumper
#
# quick & dirty code to print out TokeParser output
use strict;
use HTML::TokeParser;
use LWP::Simple;
print "Content-type: text/html\n\n";
my $filename = 'temp.html';
open FH, ">$filename";
print FH get("http://www.buchanie.co.uk/news.asp");
close FH;
my $stream = HTML::TokeParser->new($filename)
|| die "Couldn't read HTML file $filename: $!";
while(my $token = $stream->get_token) {
if ($token->[0] eq "S"){
print "Token:S 1:$token->[1]\n";
foreach my $key(keys %{$token->[2]}){
print "Key: $key Value: ${$token->[2]}{$key}\n";
}
print "3: @{$token->[3]}\n4: $token->[4]\n\n";
}
elsif ($token->[0] eq "E"){
print "Token:E 1:$token->[1] 2: $token->[2]\n\n";
}
elsif ($token->[0] eq "T"){
print "Token:T 1:$token->[1]\n\n";
}
elsif ($token->[0] eq "C"){
print "Token:C 1:$token->[1]\n\n";
}
elsif ($token->[0] eq "D"){
print "Token:D 1:$token->[1]\n\n";
}
else {print "Unknown token $token\n\n";}
}
There is no emoticon for what I'm feeling now.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|