Yes and yes. Like I said above it was a simple oversight on my part. Here is a more "clear" example ;D The depth 1 nodes are described in __END__. Since my code is specific to the task of extracting depth 1 nodes (now that I have appropriately ensured that), I like it better than demerphqs. Don't get me wrong, I like his tree, it's more generic and probably more useful, but for this particular task, it's HTML::TokeParser to the rescue

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TokeParser;

my $url ="http://perlmonks.org/index.pl?node_id=110166";

my $rawHTML = get($url); # attempt to d/l the page to mem

die "LWP::Simple messed up $!" unless $rawHTML;

my ($tp , %monks );
$tp = HTML::TokeParser->new(\$rawHTML) or die "WTF $tp gone bad: $!";

# And now -- a generic HTML::TokeParser loop

while (my $t = $tp->get_token)
{
  if(
     ($$t[0] eq "S") and
     ($$t[1] eq "tr") and
     (exists $$t[2]->{bgcolor} and $$t[2]->{bgcolor} eq "eeeeee")
    )
  {
    my @t = (
                $t,# 0 <TR BGCOLOR=eeeeee>
    $tp->get_token,# 1 <TD colspan=2>
    $tp->get_token,# 2 <font size=2>
    $tp->get_token,# 3 <A HREF="/index.pl?node_id=110171&lastnode_id=1
+10166">
    $tp->get_token,# 4 Re: Name Space
    $tp->get_token,# 5 </A>
    $tp->get_token,# 6 <BR>
    $tp->get_token,# 7  by 
    $tp->get_token,# 8 <A HREF="/index.pl?node_id=1936&lastnode_id=110
+166">
    $tp->get_token,# 9 japhy
    $tp->get_token,#10 </A>
    $tp->get_token,#11  on Sep 04, 2001 at 13:42
    $tp->get_token,#12 </font>
    $tp->get_token,#13 </TD>
    $tp->get_token,#14 </tr>
    );


    if(
       ($t[0][0] eq "S" and $t[0][1] eq "tr"
              and $t[0][2]->{'bgcolor'} eq "eeeeee") and

       ($t[1][0] eq "S" and $t[1][1] eq "td") and
       ($t[2][0] eq "S" and $t[2][1] eq "font") and
       ($t[3][0] eq "S" and $t[3][1] eq "a") and # reply link
       ($t[4][0] eq "T") and # reply to original node
       ($t[5][0] eq "E" and $t[5][1] eq "a") and
       ($t[6][0] eq "S" and $t[6][1] eq "br") and
       ($t[7][0] eq "T" and $t[7][1] =~ /by/ ) and
       ($t[8][0] eq "S" and $t[8][1] eq "a") and # userlink
       ($t[9][0] eq "T" ) and # username
       ($t[10][0] eq "E" and $t[10][1] eq "a") and
       ($t[11][0] eq "T" and $t[11][1] =~ /on \w{3} \d{2}, \d{4} at/) 
+and
       ($t[12][0] eq "E" and $t[12][1] eq "font") and
       ($t[13][0] eq "E" and $t[13][1] eq "td") and
       ($t[14][0] eq "E" and $t[14][1] eq "tr")
      )
    {
       print $t[3][4], # a href
             $t[9][1], # monk name
             "</A>|\n";

       $monks{$t[9][1]}= "$t[3][4]" . "$t[9][1]</A>";
    }
  }
} # endof while (my $token = $p->get_token)

undef $rawHTML; # no more raw html
undef $tp;      # destroy the HTML::TokeParser object (don't need it n
+o more)

print "<H1> or sorted </H1>\n";

for my $key (sort keys %monks)
{
    print $monks{$key},"|\n";
}


__END__
## one token per line
<TR BGCOLOR=eeeeee>
<TD colspan=2>
<font size=2>
<A HREF="/index.pl?node_id=110171&lastnode_id=110166">
Re: Name Space
</A>
<BR>
 by 
<A HREF="/index.pl?node_id=1936&lastnode_id=110166">
japhy
</A>
 on Sep 04, 2001 at 13:42
</font>
</TD>
</tr>
[download]

and the output

japhy| tilly| ichimunki| runrig| demerphq| shotgunefx| Masem| synapse0| agent00013| MrNobo1024| Corion| Zaxo| idnopheq| dragonchild| herveus| wine| TheoPetersen| toadi| dga| mexnix| cadfael| buckaduck| ybiC| {NULE}| theorbtwo| Jouke| gregor42| Guildenstern| sifukurt| CubicSpline| jackdied| suaveant| poqui| mikeB| davis| s173451000| PotPieMan| mr_mischief| earthboundmisfit| kwoff| Arguile| chaoticset| BrentDax| Aighearach| basicdez| brianarn| BooK| riffraff| seanbo| Maestro_007| stefan k| dthacker| Hero Zzyzzx| beretboy| Veachian64| giulienk| blakem| Chmrr|

or sorted

Aighearach| Arguile| BooK| BrentDax| Chmrr| Corion| CubicSpline| Guildenstern| Hero Zzyzzx| Jouke| Maestro_007| Masem| MrNobo1024| PotPieMan| TheoPetersen| Veachian64| Zaxo| agent00013| basicdez| beretboy| blakem| brianarn| buckaduck| cadfael| chaoticset| davis| demerphq| dga| dragonchild| dthacker| earthboundmisfit| giulienk| gregor42| herveus| ichimunki| idnopheq| jackdied| japhy| kwoff| mexnix| mikeB| mr_mischief| poqui| riffraff| runrig| s173451000| seanbo| shotgunefx| sifukurt| stefan k| suaveant| synapse0| theorbtwo| tilly| toadi| wine| ybiC| {NULE}|

update:

riight, but like I said, i'm deliberately matching only replies of depth 1, which all do conform (only 2nd level replies got the ul bug, and If i was parsing them, I'd just have the improper html in there regardless). I saw what you did ;D

___crazyinsomniac_______________________________________
Disclaimer: Don't blame. It came from inside the void
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

In reply to Re: Re: (crazyinsomniac) Re: Extract info from HTML by crazyinsomniac
in thread Extract info from HTML by George_Sherston

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


go ahead... be a heretic
	PerlMonks

comment on

or sorted