Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Extracting from HTML tables

by Cody Pendant (Prior)
on Apr 17, 2006 at 08:15 UTC ( [id://543776]=perlquestion: print w/replies, xml ) Need Help??

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I'm using HTML::TableExtract to scrape a website.

The module either works in html mode or it doesn't, as set by the keep_html option in the constructor.

Troubke is, I want to get at some columns as text and others as HTML, (to get at some productIDs in URLs).

The workaround goes like:

  • make the first TableExtract object with keep_html off, go through table rows creating an AoH with the text values.
  • make a second TableExtract object with keep_html on, go through table rows again updating the AoH with the values from HTML.
Obviously I'd be in big trouble if the two parsers didn't find the same data, but that's not a problem.

Is there a smarter way to do this or another table module which would help?

TIA



($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Replies are listed 'Best First'.
Re: Extracting from HTML tables
by tempest (Sexton) on Apr 17, 2006 at 09:18 UTC
    you can use HTML::TagFilter to get rid of unwanted markup once you get it... best i can think of.
Re: Extracting from HTML tables
by mojotoad (Monsignor) on Apr 17, 2006 at 21:47 UTC
    Hi Cody,

    If you extract in 'tree' mode then the returned structure is actually a full-fledged HTML::ElementTable object. Example usage similar to what you seem to want:

    #!/usr/bin/perl use strict; use warnings; # load in 'tree' mode for working with # HTML::Element structures. note that in # this case, subtables are *not* decoupled # from one another. use HTML::TableExtract 'tree'; my $te = HTML::TableExtract->new( # extraction parameters here...note that # in tree mode, keep_html is irrelevant ); $te->parse_file("./myfile.html"); my $t = $te->first_table_found or die "oops, no tables.\n"; # at this point we can work with $t->rows and the # cells within, but rather than text or html, the # content is now individual element objects/trees # for html... print "H::TE as html:\n"; foreach my $row ($t->rows) { print join(':', map { $_->as_HTML } @$row), "\n"; } # for text... print "H::TE as text:\n"; foreach my $row ($t->rows) { print join(':', map { $_->as_text } @$row), "\n"; } # Alternatively, you could switch entirely over # to the HTML::ElementTable structure my $et = $t->tree; # as html print "H::ET as html:\n"; print $et->as_HTML, "\n"; # as text print "H::ET as text:\n"; print $et->as_text, "\n";

    Cheers,
    Matt

      Very sensible, thanks a lot. That's much more efficient than going over two copies of the same data with two different agents.


      ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
      =~y~b-v~a-z~s; print

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://543776]
Approved by wfsp
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-04-24 02:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found