Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Should I use; Html Parser, table extract, Extractor

by a_non_moose (Initiate)
on Dec 20, 2005 at 22:01 UTC ( [id://518195]=perlquestion: print w/replies, xml ) Need Help??

a_non_moose has asked for the wisdom of the Perl Monks concerning the following question:

I've been having a hard time figuring out perl modules, and have only been trying some simple perl code after a decade of no programming at all.

So, I take the snippet of code from HTML::Parser as listed in the 3rd example, only changing title to table:
use HTML::Parser (); sub start_handler { return if shift ne "table"; my $self = shift; $self->handler(text => sub { print shift }, "dtext"); $self->handler(end => sub { shift->eof if shift eq "table"; }, "tagname,self"); } my $p = HTML::Parser->new(api_version => 3); $p->handler( start => \&start_handler, "tagname,self"); $p->parse_file(shift || die) || die $!; print "\n";
Now, my boss (who knows a bit more practical experience with perl) and I have been trying different things to brute force data extraction, but usually wound up with a ton of tags and other XML garbage printing out.

If running the code above on an example saved from here.

Everything comes out fine, except a lot of the paragraph tags/TR have nbsp's in them, that under Active Perl show up as accented A's.

So far, neither of us has been able to remove/skip the nbsp's, and/or ignore them so they are not counted as part of the output.
Now the whole point as I understand is to eventually dump this data into an Oracle db, if we can get past this current bump.

And it seems that among the Parser, Extractor, TableExtract there is a bit of everything we need, but I'll be darned if I can figure out what and where it goes after 2 weeks of reading.

If anyone cares to play "Help The Idjit", many thanks.
Adding comments to the above code, if you would be so kind, and help me understand WTH is going on. (i.e. Talk to me like a bright 5 year old {grin}).

Replies are listed 'Best First'.
Re: Should I use; Html Parser, table extract, Extractor
by GrandFather (Saint) on Dec 20, 2005 at 22:50 UTC

    If what you are tryuing to do is extract the data from the table then the following code using HTML::TreeBuilder and HTML::ElementTable may be a good starting point for you:

    use strict; use warnings; use LWP::Simple; use HTML::TreeBuilder; use HTML::ElementTable; my $page = get ('http://www.ovt.ncsu.edu/cotton_soy/2004/table_11.html +'); my $root = HTML::TreeBuilder->new_from_content ($page); my $theTable = $root->find ('table'); die "Table not found" if ! defined $theTable; $theTable = HTML::ElementTable->new_from_tree($theTable); for my $row (1..$theTable->maxrow()-2) { for (0..$theTable->maxcol()) { my $cellText = $theTable->cell ($row, $_)->as_text (); print "$cellText "; } print "\n"; }

    Note that $theTable->maxrow()-2 ignores the last two rows to avoid a problem with missing cells in those rows and the first row is skipped for the same reason.


    DWIM is Perl's answer to Gödel
Re: Should I use; Html Parser, table extract, Extractor
by ikegami (Patriarch) on Dec 20, 2005 at 22:28 UTC

    The   are being replaced with the appropriate unicode character. You're seeing it as a pair or "random" characters because you're trying to view UTF-8 output as another character set.

    The fix is to find out which character the character, then use it to replace the character with a space. Manually find the position of a non-breaking space in a string and display its character number using:
    printf("nbsp is \\x{%04X}\n", ord(substr($string, $pos, 1)));
    Then you'll know what to use instead of \x{1234} in
    s/\x{1234}/ /g;
    in order to replace non-breaking spaces with normal spaces.

Re: Should I use; Html Parser, table extract, Extractor
by mojotoad (Monsignor) on Dec 21, 2005 at 22:23 UTC
    Hi there !moose,

    A couple of observations regarding two of the modules being mentioned, HTML::TableExtract and HTML::ElementTable:

    These play much better together than they used to in times past. So now you can use HTML::TableExtract to automatically return an HTML::ElementTable structure if you want, thereby bypassing the HTML::Parser code if you so desire:

    use HTML::TableExtract qw(tree); my $te = HTML::TableExtract->new(); my $table = $te->first_table_found(); # $table is an HTML::ElementTable structure # ... maybe edit the tree structure here print $table->as_HTML;

    Also, since you're fairly new to both modules, I'll point out that the normal operation of HTML::TableExtract is to return the raw text, stripped of all html. It is very similar in structure to the above code:

    use HTML::TableExtract; my $te = HTML::TableExtract->new(); my $table = $te->first_table_found(); foreach my $row ($table->rows) { foreach my $cell (@$row) { ... maybe edit text } }

    Alternatively, you can preserve the HTML in each cell as text:

    use HTML::TableExtract; my $te = HTML::TableExtract->new(keep_html => 1); my $table = $te->first_table_found(); foreach my $row ($table->rows) { foreach my $cell (@$row) { ... maybe edit html text } }

    Another option for H::TE that you might find useful is the 'decode' option for when you're extracting in text mode (without keeping the html). When this is disabled (decode => 0 ... it's enabled by default) then your codes for things such as 'nbsp' are not translated into their actual character -- that might make it easier for search and replace type operations.

    Cheers,
    Matt

      Thanks mojo (and gu)for the info, as I thought I was missing something (besides experience).

      Was not sure if there was more to the modules, something different with active perl under XP or some other thing I was unaware of.

      Going to digest this over the Xmas holiday after hitting the books, or taking a mental break and starting fresh "next year".
Re: Should I use; Html Parser, table extract, Extractor
by a_non_moose (Initiate) on Dec 21, 2005 at 04:27 UTC
    That's exactly what I'm trying to do, GrandFather and ikegami.

    Thanks to both of you for kicking the brain cells into gear, because I suppose I'm trying to do a little bit of both.

    I recalled that previous discussions with my boss is that after getting something that parses the file --which we did, just nowhere near as clean as the html::parser-- then try to either skip/strip:
    nbsp's
    the asterixes '*'
    and stop on the line where/when it encounters either "Mean" or a series of "bold" tags.

    ikegami, I see what you're saying, but the only educated guess I had was to do something like this in GrandFather's code:
    for (0..$theTable->maxcol()) { my $cellText = $theTable->cell ($row, $_)->as_text (); # Next two lines are for searching and replacing nbsp's with regular # spaces or evaluate such that if $celltext ='s nbsp or * not #to print it # printf("nbsp is \\x{%04X}\n", ord(substr($string, $pos, 1))); # printf("nbsp is \\x{%04X}\n", ord(substr($cellText, $row, 1))); # s/\x{1234}/ /g; print "$cellText "; }
    The above is a educated/SWA guess, as I don't have my Perl book handy, but it'd be something like:
    if $cellText = nbsp or *, do nothing
    elseif print $cellText.

    Sound about right?

    Gotta go to bed, but many thanks for your helping the noob
Re: Should I use; Html Parser, table extract, Extractor
by gu (Beadle) on Dec 21, 2005 at 08:21 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://518195]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2024-04-24 01:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found