I am attempting to extract some data from an HTML page using the Web::Scraper module. The HTML looks as shown below:
<table class="dextable" align="center">
<tr>
<td class="fooevo">ID No.</td>
<td class="fooevo">Picture</td>
<td class="fooevo">Pokémon Name</td>
<td class="fooevo">Rarity</td>
<td class="fooevo">Movement</td>
<td class="fooevo">Material Cost</td>
</tr>
<tr>
<td class="cen">ID - 26</td>
<td class="cen"><a href="figures/26-poliwag.shtml"><img src="/duel
+/figures/th/26.jpg" alt="Poliwag" title="Poliwag" border="0" /></a></
+td>
<td class="fooinfo"><a href="figures/26-poliwag.shtml"><u>Poliwag<
+/u></a></td>
<td class="cen"><img src="/duel/c.png" /> C</td>
<td class="cen">3</td>
<td class="fooinfo"><img src="/duel/material.png" />250</td>
</tr>
I need to extract the URL in the href attribute of the 2nd td element as well as the text of the title attribute of the same (2nd) td element.
The code I have written for this is as shown below:
#!/usr/bin/perl -w
use URI;
use Web::Scraper;
use Encode;
# First, create your scraper block
my $p1 = scraper {
process 'table[class="dextable"] td[class="cen"]', "list[]" => scr
+aper {
# And, in each td,
# get the URI of "a" element
process_first "a", uri => '@href';
# get text inside "u" element
process_first "a", name => '@title';
};
};
my $res = $p1->scrape( URI->new("http://serebii.net/duel/figures.shtml
+") );
for my $p (@{$res->{list}}) {
print Encode::encode("utf8", "$p->{name}\t$p->{uri}\n");
}
The code shown above does not work. It prints the URL correctly but not the name. Also, it seems to be picking up other td elements that don't have a nested <a> element in them and for those td elements, it again displays an error. So, to summarize, I guess I'm looking for answers to two questions:
- How do I get the Web::Scraper module to extract the name attribute?
- How do I get the Web::Scraper module to ignore those td elements without a nested <a> element in them?
Thank you in advance.