http://qs321.pair.com?node_id=589103


in reply to Process a HTML file to get information from it.

here's my go
#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html = q{ <a name="a"></a> <h2>A</h2> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td> <table width="100%" cellpadding="5" cellspacing="0" border="1"> <tr> <td width="33%" valign="top" class="clsTableBody"> <a href="pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf" targe +t="_blank"> Abbott, Evelyn </a><br /> <span>110136892</span><br /> <a href="pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf" targe +t="_blank"> Agnew, Thomas </a><br /> <span>110377660</span><br /> </td> <td width="34%" valign="top" class="clsTableBodyClear"> <a href="pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf" targe +t="_blank"> Allison, David </a><br /> <span>108116112</span><br /> <a href="pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf" targe +t="_blank"> Allison, Gary Owen </a><br /> <span>116815754</span><br /> </td> <td width="33%" valign="top" class="clsTableBody"> <a href="pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf" targe +t="_blank"> Arsenault, Michael </a><br /> <span>108318866</span><br /> <a href="pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf" targe +t="_blank"> Arsenault, Normand A. </a><br /> <span>113069066</span><br /> </td> </tr> </table> </td> </tr> </table> }; my $p = HTML::TokeParser::Simple->new(\$html); # parse until second table my $table_count = 2; while (my $t = $p->get_tag('table')){ last unless --$table_count; } my (%href, $this_href, $number); while (my $t = $p->get_token){ if ($t->is_start_tag('a')){ $this_href = $t->get_attr('href'); next; } if ($t->is_start_tag('span')){ $number = $p->get_trimmed_text('/span'); $href{$this_href} = $number; next; } last if $t->is_end_tag('table'); } for my $key (keys %href){ print "$key -> $href{$key}\n"; }
output:
---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 > Terminated with exit code 0.

Replies are listed 'Best First'.
Re^2: Process a HTML file to get information from it.
by Griffler (Novice) on Dec 11, 2006 at 19:18 UTC
    This is great but how would I modify this to parse through a file that has that same table structure 25 more time. (Basically One table for each letter of the alphabet.)
      Assuming each letter is in an H2 tag (and that these are the only H2 tags) and that each structure is identical.

      This should do the trick. We collect the data into a HoH (%href).

      Hope this helps.

      my $p = HTML::TokeParser::Simple->new(\$html); my (%href, $this_href, $number, $letter); while (my $t = $p->get_token){ if ($t->is_start_tag('h2')){ $letter = $p->get_trimmed_text('/h2'); next; } if ($t->is_start_tag('a')){ # skip bookmarks next if $t->get_attr('name'); $this_href = $t->get_attr('href'); next; } if ($t->is_start_tag('span')){ $number = $p->get_trimmed_text('/span'); $href{$letter}{$this_href} = $number; next; } }
      output
      ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl A pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 B pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 C pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 > Terminated with exit code 0..