Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^2: Process a HTML file to get information from it.

by Griffler (Novice)
on Dec 11, 2006 at 19:18 UTC ( #589122=note: print w/replies, xml ) Need Help??


in reply to Re: Process a HTML file to get information from it.
in thread Process a HTML file to get information from it.

This is great but how would I modify this to parse through a file that has that same table structure 25 more time. (Basically One table for each letter of the alphabet.)
  • Comment on Re^2: Process a HTML file to get information from it.

Replies are listed 'Best First'.
Re^3: Process a HTML file to get information from it.
by wfsp (Abbot) on Dec 12, 2006 at 07:52 UTC
    Assuming each letter is in an H2 tag (and that these are the only H2 tags) and that each structure is identical.

    This should do the trick. We collect the data into a HoH (%href).

    Hope this helps.

    my $p = HTML::TokeParser::Simple->new(\$html); my (%href, $this_href, $number, $letter); while (my $t = $p->get_token){ if ($t->is_start_tag('h2')){ $letter = $p->get_trimmed_text('/h2'); next; } if ($t->is_start_tag('a')){ # skip bookmarks next if $t->get_attr('name'); $this_href = $t->get_attr('href'); next; } if ($t->is_start_tag('span')){ $number = $p->get_trimmed_text('/span'); $href{$letter}{$this_href} = $number; next; } }
    output
    ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl A pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 B pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 C pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 > Terminated with exit code 0..

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://589122]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (1)
As of 2022-01-29 14:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:












    Results (74 votes). Check out past polls.

    Notices?