here's my go
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
my $html = q{
<a name="a"></a>
<h2>A</h2>
<table width="100%" cellpadding="0" cellspacing="0" border="0">
<tr>
<td>
<table width="100%" cellpadding="5" cellspacing="0" border="1">
<tr>
<td width="33%" valign="top" class="clsTableBody">
<a href="pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf" targe
+t="_blank">
Abbott, Evelyn
</a><br />
<span>110136892</span><br />
<a href="pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf" targe
+t="_blank">
Agnew, Thomas
</a><br />
<span>110377660</span><br />
</td>
<td width="34%" valign="top" class="clsTableBodyClear">
<a href="pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf" targe
+t="_blank">
Allison, David
</a><br />
<span>108116112</span><br />
<a href="pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf" targe
+t="_blank">
Allison, Gary Owen
</a><br />
<span>116815754</span><br />
</td>
<td width="33%" valign="top" class="clsTableBody">
<a href="pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf" targe
+t="_blank">
Arsenault, Michael
</a><br />
<span>108318866</span><br />
<a href="pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf" targe
+t="_blank">
Arsenault, Normand A.
</a><br />
<span>113069066</span><br />
</td>
</tr>
</table>
</td>
</tr>
</table>
};
my $p = HTML::TokeParser::Simple->new(\$html);
# parse until second table
my $table_count = 2;
while (my $t = $p->get_tag('table')){
last unless --$table_count;
}
my (%href, $this_href, $number);
while (my $t = $p->get_token){
if ($t->is_start_tag('a')){
$this_href = $t->get_attr('href');
next;
}
if ($t->is_start_tag('span')){
$number = $p->get_trimmed_text('/span');
$href{$this_href} = $number;
next;
}
last if $t->is_end_tag('table');
}
for my $key (keys %href){
print "$key -> $href{$key}\n";
}
output:
---------- Capture Output ----------
> "C:\Perl\bin\perl.exe" _new.pl
pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660
pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892
pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866
pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112
pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066
pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754
> Terminated with exit code 0.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|