Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Yes and yes. Like I said above it was a simple oversight on my part. Here is a more "clear" example ;D The depth 1 nodes are described in __END__. Since my code is specific to the task of extracting depth 1 nodes (now that I have appropriately ensured that), I like it better than demerphqs. Don't get me wrong, I like his tree, it's more generic and probably more useful, but for this particular task, it's HTML::TokeParser to the rescue
#!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::TokeParser; my $url ="http://perlmonks.org/index.pl?node_id=110166"; my $rawHTML = get($url); # attempt to d/l the page to mem die "LWP::Simple messed up $!" unless $rawHTML; my ($tp , %monks ); $tp = HTML::TokeParser->new(\$rawHTML) or die "WTF $tp gone bad: $!"; # And now -- a generic HTML::TokeParser loop while (my $t = $tp->get_token) { if( ($$t[0] eq "S") and ($$t[1] eq "tr") and (exists $$t[2]->{bgcolor} and $$t[2]->{bgcolor} eq "eeeeee") ) { my @t = ( $t,# 0 <TR BGCOLOR=eeeeee> $tp->get_token,# 1 <TD colspan=2> $tp->get_token,# 2 <font size=2> $tp->get_token,# 3 <A HREF="/index.pl?node_id=110171&lastnode_id=1 +10166"> $tp->get_token,# 4 Re: Name Space $tp->get_token,# 5 </A> $tp->get_token,# 6 <BR> $tp->get_token,# 7 by $tp->get_token,# 8 <A HREF="/index.pl?node_id=1936&lastnode_id=110 +166"> $tp->get_token,# 9 japhy $tp->get_token,#10 </A> $tp->get_token,#11 on Sep 04, 2001 at 13:42 $tp->get_token,#12 </font> $tp->get_token,#13 </TD> $tp->get_token,#14 </tr> ); if( ($t[0][0] eq "S" and $t[0][1] eq "tr" and $t[0][2]->{'bgcolor'} eq "eeeeee") and ($t[1][0] eq "S" and $t[1][1] eq "td") and ($t[2][0] eq "S" and $t[2][1] eq "font") and ($t[3][0] eq "S" and $t[3][1] eq "a") and # reply link ($t[4][0] eq "T") and # reply to original node ($t[5][0] eq "E" and $t[5][1] eq "a") and ($t[6][0] eq "S" and $t[6][1] eq "br") and ($t[7][0] eq "T" and $t[7][1] =~ /by/ ) and ($t[8][0] eq "S" and $t[8][1] eq "a") and # userlink ($t[9][0] eq "T" ) and # username ($t[10][0] eq "E" and $t[10][1] eq "a") and ($t[11][0] eq "T" and $t[11][1] =~ /on \w{3} \d{2}, \d{4} at/) +and ($t[12][0] eq "E" and $t[12][1] eq "font") and ($t[13][0] eq "E" and $t[13][1] eq "td") and ($t[14][0] eq "E" and $t[14][1] eq "tr") ) { print $t[3][4], # a href $t[9][1], # monk name "</A>|\n"; $monks{$t[9][1]}= "$t[3][4]" . "$t[9][1]</A>"; } } } # endof while (my $token = $p->get_token) undef $rawHTML; # no more raw html undef $tp; # destroy the HTML::TokeParser object (don't need it n +o more) print "<H1> or sorted </H1>\n"; for my $key (sort keys %monks) { print $monks{$key},"|\n"; } __END__ ## one token per line <TR BGCOLOR=eeeeee> <TD colspan=2> <font size=2> <A HREF="/index.pl?node_id=110171&lastnode_id=110166"> Re: Name Space </A> <BR> by <A HREF="/index.pl?node_id=1936&lastnode_id=110166"> japhy </A> on Sep 04, 2001 at 13:42 </font> </TD> </tr>
and the output
japhy| tilly| ichimunki| runrig| demerphq| shotgunefx| Masem| synapse0| agent00013| MrNobo1024| Corion| Zaxo| idnopheq| dragonchild| herveus| wine| TheoPetersen| toadi| dga| mexnix| cadfael| buckaduck| ybiC| {NULE}| theorbtwo| Jouke| gregor42| Guildenstern| sifukurt| CubicSpline| jackdied| suaveant| poqui| mikeB| davis| s173451000| PotPieMan| mr_mischief| earthboundmisfit| kwoff| Arguile| chaoticset| BrentDax| Aighearach| basicdez| brianarn| BooK| riffraff| seanbo| Maestro_007| stefan k| dthacker| Hero Zzyzzx| beretboy| Veachian64| giulienk| blakem| Chmrr|

or sorted

Aighearach| Arguile| BooK| BrentDax| Chmrr| Corion| CubicSpline| Guildenstern| Hero Zzyzzx| Jouke| Maestro_007| Masem| MrNobo1024| PotPieMan| TheoPetersen| Veachian64| Zaxo| agent00013| basicdez| beretboy| blakem| brianarn| buckaduck| cadfael| chaoticset| davis| demerphq| dga| dragonchild| dthacker| earthboundmisfit| giulienk| gregor42| herveus| ichimunki| idnopheq| jackdied| japhy| kwoff| mexnix| mikeB| mr_mischief| poqui| riffraff| runrig| s173451000| seanbo| shotgunefx| sifukurt| stefan k| suaveant| synapse0| theorbtwo| tilly| toadi| wine| ybiC| {NULE}|
update:

riight, but like I said, i'm deliberately matching only replies of depth 1, which all do conform (only 2nd level replies got the ul bug, and If i was parsing them, I'd just have the improper html in there regardless). I saw what you did ;D

 
___crazyinsomniac_______________________________________
Disclaimer: Don't blame. It came from inside the void

perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"


In reply to Re: Re: (crazyinsomniac) Re: Extract info from HTML by crazyinsomniac
in thread Extract info from HTML by George_Sherston

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2024-04-19 22:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found