Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Best module to scrape tabular data fram web pages?

by punch_card_don (Curate)
on Mar 10, 2006 at 15:04 UTC ( [id://535718]=perlquestion: print w/replies, xml ) Need Help??

punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Mango Monks,

Looking for a good, simple, quick to get up and running, module to scrape some data from a table in a web page.

The url is basic, no authentication.

The table looks like:

City 1Cloudy-5°C
City 2Cloudy-10°C
City 3Light Snow1°C
City 4Fog Depositing Ice-11°C

<table width="100%" border=1 cellspacing="1" cellpadding="1"> <TR valign="top" BGCOLOR=#FFFFFF> <td align="top"><a href='/forecast/city.html?1'>City 1</a></td><td now +rap align="top">Cloudy</td><td nowrap align="right">-5&deg;C</td></tr +> <TR valign="top" BGCOLOR=#EEF5EE> <td align="top"><a href='/forecast/city.html?2'>City 2</a></td><td now +rap align="top">Cloudy</td><td nowrap align="right">-10&deg;C</td></t +r> <TR valign="top" BGCOLOR=#FFFFFF> <td align="top"><a href='/forecast/city.html?3'>City 3</a></td><td now +rap align="top">Light Snow</td><td nowrap align="right">1&deg;C</td>< +/tr> <TR valign="top" BGCOLOR=#EEF5EE> <td align="top"><a href='/forecast/city.html?4'>City 4</a></td><td now +rap align="top">Fog Depositing Ice</td><td nowrap align="right">-11&d +eg;C</td></tr> </table>
And I want to scrape off the city names, conditions, temperature. I can count on the columns always being in the same order.

Not hard to write a custom parser, but if thee's a module out there ideally suited to this kind of thing, that'd be preferable.

Thanks.





Forget that fear of gravity,
Get a little savagery in your life.

Replies are listed 'Best First'.
Re: Best module to scrape tabular data fram web pages?
by kwaping (Priest) on Mar 10, 2006 at 15:49 UTC
Re: Best module to scrape tabular data fram web pages?
by ptum (Priest) on Mar 10, 2006 at 15:24 UTC

    I've used HTML::TreeBuilder and WWW::Mechanize for such purposes before. But there may be something better ... I make no guarantees. :)


    No good deed goes unpunished. -- (attributed to) Oscar Wilde
Re: Best module to scrape tabular data fram web pages?
by mojotoad (Monsignor) on Mar 10, 2006 at 17:25 UTC
    For your future projects, do consider HTML::TableExtract.

    use HTML::TableExtract; my $te = HTML::TableExtract->new; $te->parse(join('', <>)); foreach my $row ($te->first_table_found->rows) { print join(':', @$row), "\n"; }

    In reality, given the entire HTML document, you'd probably need to specify a depth/count in the constructor for H::TE.

    Cheers,
    Matt

Re: Best module to scrape tabular data fram web pages?
by Mutant (Priest) on Mar 10, 2006 at 15:46 UTC
    I've always found HTML::TokeParser::Simple to be the best HTML parser, although I haven't used it for screen scraping per se.
Re: Best module to scrape tabular data fram web pages?
by punch_card_don (Curate) on Mar 10, 2006 at 17:06 UTC
    My own answer: Thanks for the suggestins. I looked them over, tried implimenting a bit, then wondered if writing something might not be faster. In less time that I had already spent investigating the modules, I wrote a 10-line parser.

    Sometimes it's faster to do custom than learn a new module. Ya, I know, I didn't learn anything new - but the job will be done and the client will be happy.

    But htey do look like highly useful modules for the not too distant future - like when this project gets more complicated and my 10-liner breaks!





    Forget that fear of gravity,
    Get a little savagery in your life.
      like when ... my 10-liner breaks!

      As many here will tell you, that is almost inevitable when trying to parse HTML on your own. :)
      ---
      It's all fine and dandy until someone has to look at the code.
Re: Best module to scrape tabular data fram web pages?
by xern (Beadle) on Mar 11, 2006 at 23:17 UTC
    You may consider to use FEAR::API. The code would be like this:
    use FEAR::API -base; file("your_source_file") ->document ->html_to_xhtml ->xpath("/html/body/table/tr"); doc_filter(use => 'remove_attributes'); print doc->as_string; template(qq!<tr>\n<td><a>[% city %]</a></td>\n<td>[% cond %]</td>\n<td +>[% temp %]</td>\n</tr>!); extract; print Dumper extresult;
Re: Best module to scrape tabular data fram web pages?
by zentara (Archbishop) on Mar 10, 2006 at 17:59 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://535718]
Approved by rev_1318
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2024-04-18 23:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found