Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Process a HTML file to get information from it.

by wfsp (Abbot)
on Dec 11, 2006 at 18:00 UTC ( #589103=note: print w/replies, xml ) Need Help??


in reply to Process a HTML file to get information from it.

here's my go
#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html = q{ <a name="a"></a> <h2>A</h2> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td> <table width="100%" cellpadding="5" cellspacing="0" border="1"> <tr> <td width="33%" valign="top" class="clsTableBody"> <a href="pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf" targe +t="_blank"> Abbott, Evelyn </a><br /> <span>110136892</span><br /> <a href="pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf" targe +t="_blank"> Agnew, Thomas </a><br /> <span>110377660</span><br /> </td> <td width="34%" valign="top" class="clsTableBodyClear"> <a href="pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf" targe +t="_blank"> Allison, David </a><br /> <span>108116112</span><br /> <a href="pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf" targe +t="_blank"> Allison, Gary Owen </a><br /> <span>116815754</span><br /> </td> <td width="33%" valign="top" class="clsTableBody"> <a href="pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf" targe +t="_blank"> Arsenault, Michael </a><br /> <span>108318866</span><br /> <a href="pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf" targe +t="_blank"> Arsenault, Normand A. </a><br /> <span>113069066</span><br /> </td> </tr> </table> </td> </tr> </table> }; my $p = HTML::TokeParser::Simple->new(\$html); # parse until second table my $table_count = 2; while (my $t = $p->get_tag('table')){ last unless --$table_count; } my (%href, $this_href, $number); while (my $t = $p->get_token){ if ($t->is_start_tag('a')){ $this_href = $t->get_attr('href'); next; } if ($t->is_start_tag('span')){ $number = $p->get_trimmed_text('/span'); $href{$this_href} = $number; next; } last if $t->is_end_tag('table'); } for my $key (keys %href){ print "$key -> $href{$key}\n"; }
output:
---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 > Terminated with exit code 0.

Replies are listed 'Best First'.
Re^2: Process a HTML file to get information from it.
by Griffler (Novice) on Dec 11, 2006 at 19:18 UTC
    This is great but how would I modify this to parse through a file that has that same table structure 25 more time. (Basically One table for each letter of the alphabet.)
      Assuming each letter is in an H2 tag (and that these are the only H2 tags) and that each structure is identical.

      This should do the trick. We collect the data into a HoH (%href).

      Hope this helps.

      my $p = HTML::TokeParser::Simple->new(\$html); my (%href, $this_href, $number, $letter); while (my $t = $p->get_token){ if ($t->is_start_tag('h2')){ $letter = $p->get_trimmed_text('/h2'); next; } if ($t->is_start_tag('a')){ # skip bookmarks next if $t->get_attr('name'); $this_href = $t->get_attr('href'); next; } if ($t->is_start_tag('span')){ $number = $p->get_trimmed_text('/span'); $href{$letter}{$this_href} = $number; next; } }
      output
      ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl A pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 B pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 C pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 > Terminated with exit code 0..

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://589103]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (2)
As of 2022-01-26 08:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:












    Results (69 votes). Check out past polls.

    Notices?