Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Using Web::Scraper to extract content from an HTML page

by SiteScraper (Initiate)
on Apr 03, 2017 at 22:32 UTC ( [id://1186917]=perlquestion: print w/replies, xml ) Need Help??

SiteScraper has asked for the wisdom of the Perl Monks concerning the following question:

I am attempting to extract some data from an HTML page using the Web::Scraper module. The HTML looks as shown below:
<table class="dextable" align="center"> <tr> <td class="fooevo">ID No.</td> <td class="fooevo">Picture</td> <td class="fooevo">Pok&eacute;mon Name</td> <td class="fooevo">Rarity</td> <td class="fooevo">Movement</td> <td class="fooevo">Material Cost</td> </tr> <tr> <td class="cen">ID - 26</td> <td class="cen"><a href="figures/26-poliwag.shtml"><img src="/duel +/figures/th/26.jpg" alt="Poliwag" title="Poliwag" border="0" /></a></ +td> <td class="fooinfo"><a href="figures/26-poliwag.shtml"><u>Poliwag< +/u></a></td> <td class="cen"><img src="/duel/c.png" /> C</td> <td class="cen">3</td> <td class="fooinfo"><img src="/duel/material.png" />250</td> </tr>
I need to extract the URL in the href attribute of the 2nd td element as well as the text of the title attribute of the same (2nd) td element. The code I have written for this is as shown below:
#!/usr/bin/perl -w use URI; use Web::Scraper; use Encode; # First, create your scraper block my $p1 = scraper { process 'table[class="dextable"] td[class="cen"]', "list[]" => scr +aper { # And, in each td, # get the URI of "a" element process_first "a", uri => '@href'; # get text inside "u" element process_first "a", name => '@title'; }; }; my $res = $p1->scrape( URI->new("http://serebii.net/duel/figures.shtml +") ); for my $p (@{$res->{list}}) { print Encode::encode("utf8", "$p->{name}\t$p->{uri}\n"); }
The code shown above does not work. It prints the URL correctly but not the name. Also, it seems to be picking up other td elements that don't have a nested <a> element in them and for those td elements, it again displays an error. So, to summarize, I guess I'm looking for answers to two questions:
  • How do I get the Web::Scraper module to extract the name attribute?
  • How do I get the Web::Scraper module to ignore those td elements without a nested <a> element in them?
Thank you in advance.

Replies are listed 'Best First'.
Re: Using Web::Scraper to extract content from an HTML page
by tangent (Parson) on Apr 04, 2017 at 01:02 UTC
    As beech points out the 'title' is in the 'img' tag not the 'a' tag so you need to account for that. Also, process_first would only work if there were multiple tags within the cell itself, not within the row. But you can skip the empty ones while looping through the results:
    my $p1 = scraper { process 'table[class="dextable"] td[class="cen"]', "list[]" => scrap +er { process "a", uri => '@href'; process "img", name => '@title'; }; }; my $res = $p1->scrape( URI->new("http://serebii.net/duel/figures.shtml +") ); for my $p (@{$res->{list}}) { next unless ($p->{name} and $p->{uri}); print Encode::encode("utf8", "$p->{name}\t$p->{uri}\n"); }
Re: Using Web::Scraper to extract content from an HTML page
by beech (Parson) on Apr 03, 2017 at 22:58 UTC

    Hi

    The key to figuring out matching problems like this is to include in your program a cut down 20 line sample html

    On the url you scrape, in the html, I see nothing that would match  a[@title] , there are no a tags/elments with a title= attribute

      Thank you, beech, for the quick response. I request you to review my original post one more time. I have actually included a snippet of the HTML that I'm trying to match against. That snippet is actually from the URL I am scraping. I got it by doing a 'View Source' on the page. Were you looking for something different?

        Thank you, beech, for the quick response. I request you to review my original post one more time. I have actually included a snippet of the HTML that I'm trying to match against. That snippet is actually from the URL I am scraping. I got it by doing a 'View Source' on the page. Were you looking for something different?

        Hi,

        It slipped by me I guess :) I looked at the website, and your html matches, so no I wasn't looking for something different

        To clarify, try

        process_first "img", name => '@title';

        a tags cannot be nested

Re:Using Web::Scraper to extract content from an HTML page
by SiteScraper (Initiate) on Apr 04, 2017 at 21:17 UTC
    Thank you, beech and tangent! I'm embarrassed that I missed noticing that the "title" attribute belonged to the img tag and not the a tag. After I made the recommended changes, the script works like a charm.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1186917]
Approved by stevieb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2024-04-26 07:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found