comment on

Weeel I started writing code on this about two hours ago, and didn't do much until I actually started thinking about the problem some 15 minutes ago (when I put down the peanuts and exausted me votes for the day) and this is what I came up with:

First, I needed to pick a module to use, and HTML::TokeParser sat really well with me. The initial problem for me, was to figure out what "html" is the one I wan't, and I did what I always do when diagnosing such a problem, I dump the entire document token by token, in this case, with:

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TokeParser;

my $url ="http://perlmonks.org/index.pl?node_id=110166";

my $rawHTML = get($url); # attempt to d/l the page to mem

die "LWP::Simple messed up $!" unless ($rawHTML);

my $tp;
$tp = HTML::TokeParser->new(\$rawHTML) or die "WTF $tp gone bad: $!";

# And now -- a generic HTML::TokeParser loop

while (my $token = $tp->get_token)
{
    my $ttype = shift @{ $token };
    print "TYPE : $ttype\n####\n";
    printf( join( '',
                  map { "$_:%s\n####\n" } 1..@{$token}
                 )
            ,
            @{$token}
          );
    print "####################################################\n\n";
}
__END__
Which produces something like:
TYPE : D
####
1:<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
>
####
####################################################

TYPE : T
####
1: 

####
2:
####
####################################################

TYPE : C
####
1:<!--took this out for IE6ites  "http://www.w3.org/TR/REC-html40/loos
+e.dtd"-->
####
####################################################

TYPE : T
####
1:

####
2:
####
####################################################

TYPE : S
####
1:html
####
2:HASH(0x1afeee0)
####
3:ARRAY(0x1afeef8)
####
4:<HTML>
####
####################################################

TYPE : T
####
1:

####
2:
####
####################################################
[download]

Then, after "visualizing" what criteria I can use to pick out the stuff I need (noted in __END__), I crafted me while loop like so:

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TokeParser;

my $url ="http://perlmonks.org/index.pl?node_id=110166";

my $rawHTML = get($url); # attempt to d/l the page to mem

die "LWP::Simple messed up $!" unless ($rawHTML);

my $tp;
$tp = HTML::TokeParser->new(\$rawHTML) or die "WTF $tp gone bad: $!";

# And now -- a generic HTML::TokeParser loop

while (my $token = $tp->get_token)
{
    my $ttype = shift @{ $token };

    if($ttype eq "S" and $token->[0] eq "br")
    {
        my ( @t ) = ( undef, #$tp->get_token, #S  0
                      $tp->get_token, #T  1
                      $tp->get_token, #S  2
                      $tp->get_token, #T  3
                      $tp->get_token, #E  4
                      $tp->get_token, #T  5
                     );

        if( # ($t[0][0] eq "S" and $t[0][1] eq "br") and
            ($t[1][0] eq "T" and $t[1][1] =~ /by/) and
            ($t[2][0] eq "S" and $t[2][1] eq "a") and
            ($t[3][0] eq "T" ) and
            ($t[4][0] eq "E" and $t[4][1] eq "a") and
            ($t[5][0] eq "T" and $t[5][1] =~ /on \w{3} \d{2}, \d{4} at
+/)
          )
        {
            print $t[2][4], $t[3][1], $t[4][2], " | ";
        }
    }

} # endof while (my $token = $p->get_token)

undef $rawHTML; # no more raw html
undef $tp;      # destroy the HTML::TokeParser object (don't need it n
+o more)


__END__

######### WITH ADDED NEWLINES FOR READABILITY AT ><

<TR BGCOLOR=eeeeee><TD colspan=2>
<UL>
<font size=2>
<A HREF="/index.pl?node_id=110247&lastnode_id=110166">
Re: Re: Name Space
</A>
<BR>
 by 
<A HREF="/index.pl?node_id=85506&lastnode_id=110166">
Hofmator
</A>
 on Sep 05, 2001 at 02:27
</UL>
</font></TD></tr>


########## BROKEN DOWN BY TOKEN


TYPE : S
####
1:br
####
2:HASH(0x1af8128)
####
3:ARRAY(0x1afeeec)
####
4:<BR>
####
####################################################

TYPE : T
####
1: by 
####
2:
####
####################################################

TYPE : S
####
1:a
####
2:HASH(0x1ab4384)
####
3:ARRAY(0x1ab6324)
####
4:<A HREF="/index.pl?node_id=85506&lastnode_id=110166">
####
####################################################

TYPE : T
####
1:Hofmator
####
2:
####
####################################################

TYPE : E
####
1:a
####
2:</A>
####
####################################################

TYPE : T
####
1: on Sep 05, 2001 at 02:27
####
2:
####
####################################################
[download]

Which produced the following list:

japhy | Hofmator | tilly | davorg | scain | ichimunki | runrig | demerphq | merphq | shotgunefx | Masem | cLive ;-) | synapse0 | lo_tech | agent00013 | MrNobo1024 | Corion | demerphq | lo_tech | George_Sherston | Hofmator | Zaxo | idnopheq | dragonchild | herveus | wine | TheoPetersen | toadi | dga | mexnix | ybiC | {NULE} | theorbtwo | George_Sherston | Jouke | George_Sherston | tye | gregor42 | Guildenstern | sifukurt | CubicSpline | scain | zakzebrowski | jackdied | suaveant | poqui | mikeB | davis | s173451000 | blakem | George_Sherston | PotPieMan | mr_mischief | Zecho | earthboundmisfit | kwoff | Arguile | chaoticset | BrentDax | Aighearach | basicdez | brianarn | George_Sherston | BooK | riffraff | seanbo | Maestro_007 | stefan k | dthacker | Hero Zzyzzx | beretboy | Veachian64 | giulienk | blakem | George_Sherston |

The lesson here is, thank god vroom has a consistent format making it possible for me to decide what i want relatively easily (and thank god for HTML::TokeParser including the RAW html so I don't have to do much recreating, just repiecing ;D).

Is it elegant? I don't care, it makes sense to me (in practice and in theory).

update: oh yeah, it's not sorted, cause I don't actually "collect" the urls (users/userids) I want, cause like you can see, I just print them out.

This may help (a token can look like):

  ["S",  $tag, $attr, $attrseq, $text]
  ["E",  $tag, $text]
  ["T",  $text, $is_data]
  ["C",  $text]
  ["D",  $text]
  ["PI", $token0, $text]
[download]

update: oh, point taken, that's just a simple oversight on my part, all i'd have to do is add a couple of more tokens... later ;D

___crazyinsomniac_______________________________________
Disclaimer: Don't blame. It came from inside the void
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

In reply to (crazyinsomniac) Re: Extract info from HTML by crazyinsomniac
in thread Extract info from HTML by George_Sherston

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


We don't bite newbies here... much
	PerlMonks