Using Web::Scraper to extract content from an HTML page

SiteScraper has asked for the wisdom of the Perl Monks concerning the following question:

I am attempting to extract some data from an HTML page using the Web::Scraper module. The HTML looks as shown below:

<table class="dextable" align="center">
<tr>
    <td class="fooevo">ID No.</td>
    <td class="fooevo">Picture</td>
    <td class="fooevo">Pok&eacute;mon Name</td>
    <td class="fooevo">Rarity</td>
    <td class="fooevo">Movement</td>
    <td class="fooevo">Material Cost</td>
</tr>
<tr>
    <td class="cen">ID - 26</td>
    <td class="cen"><a href="figures/26-poliwag.shtml"><img src="/duel
+/figures/th/26.jpg" alt="Poliwag" title="Poliwag" border="0" /></a></
+td>
    <td class="fooinfo"><a href="figures/26-poliwag.shtml"><u>Poliwag<
+/u></a></td>
    <td class="cen"><img src="/duel/c.png" /> C</td>
    <td class="cen">3</td>
    <td class="fooinfo"><img src="/duel/material.png" />250</td>
</tr>
[download]

I need to extract the URL in the href attribute of the 2nd td element as well as the text of the title attribute of the same (2nd) td element. The code I have written for this is as shown below:

#!/usr/bin/perl -w
use URI;
use Web::Scraper;
use Encode;

# First, create your scraper block
my $p1 = scraper {
    process 'table[class="dextable"] td[class="cen"]', "list[]" => scr
+aper {
      # And, in each td,
      # get the URI of "a" element 
      process_first "a", uri => '@href';
      # get text inside "u" element
      process_first "a", name => '@title';
    };
};

my $res = $p1->scrape( URI->new("http://serebii.net/duel/figures.shtml
+") );

for my $p (@{$res->{list}}) {
    print Encode::encode("utf8", "$p->{name}\t$p->{uri}\n");
}
[download]

The code shown above does not work. It prints the URL correctly but not the name. Also, it seems to be picking up other td elements that don't have a nested <a> element in them and for those td elements, it again displays an error. So, to summarize, I guess I'm looking for answers to two questions:

How do I get the Web::Scraper module to extract the name attribute?
How do I get the Web::Scraper module to ignore those td elements without a nested <a> element in them?

Thank you in advance.

Comment on Using Web::Scraper to extract content from an HTML page Select or Download Code

Replies are listed 'Best First'.
Re: Using Web::Scraper to extract content from an HTML page by tangent (Parson) on Apr 04, 2017 at 01:02 UTC
As beech points out the 'title' is in the 'img' tag not the 'a' tag so you need to account for that. Also, `process_first` would only work if there were multiple tags within the cell itself, not within the row. But you can skip the empty ones while looping through the results: `my $p1 = scraper { process 'table[class="dextable"] td[class="cen"]', "list[]" => scrap +er { process "a", uri => '@href'; process "img", name => '@title'; }; }; my $res = $p1->scrape( URI->new("http://serebii.net/duel/figures.shtml +") ); for my $p (@{$res->{list}}) { next unless ($p->{name} and $p->{uri}); print Encode::encode("utf8", "$p->{name}\t$p->{uri}\n"); }` [download]	[reply] [d/l] [select]
Re: Using Web::Scraper to extract content from an HTML page by beech (Parson) on Apr 03, 2017 at 22:58 UTC
Hi The key to figuring out matching problems like this is to include in your program a cut down 20 line sample html On the url you scrape, in the html, I see nothing that would match `a[@title]` , there are no a tags/elments with a title= attribute	[reply] [d/l]
Re^2: Using Web::Scraper to extract content from an HTML page by SiteScraper (Initiate) on Apr 03, 2017 at 23:32 UTC
Thank you, beech, for the quick response. I request you to review my original post one more time. I have actually included a snippet of the HTML that I'm trying to match against. That snippet is actually from the URL I am scraping. I got it by doing a 'View Source' on the page. Were you looking for something different?	[reply]
Re^3: Using Web::Scraper to extract content from an HTML page by beech (Parson) on Apr 04, 2017 at 00:24 UTC
Thank you, beech, for the quick response. I request you to review my original post one more time. I have actually included a snippet of the HTML that I'm trying to match against. That snippet is actually from the URL I am scraping. I got it by doing a 'View Source' on the page. Were you looking for something different? Hi, It slipped by me I guess :) I looked at the website, and your html matches, so no I wasn't looking for something different To clarify, try `process_first "img", name => '@title';` [download] a tags cannot be nested	[reply] [d/l]
Re:Using Web::Scraper to extract content from an HTML page by SiteScraper (Initiate) on Apr 04, 2017 at 21:17 UTC
Thank you, beech and tangent! I'm embarrassed that I missed noticing that the "title" attribute belonged to the img tag and not the a tag. After I made the recommended changes, the script works like a charm.	[reply]


Perl Monk, Perl Meditation
	PerlMonks