Process a HTML file to get information from it.

Griffler has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Process a HTML file to get information from it. by JediWizard (Deacon) on Dec 11, 2006 at 17:10 UTC
Try using `<code>` tags around your html to prevent it from rendering. Otherwise, I think you need something like this: `m/<a[^>]* # an anchor tag href= # the Href in the anchor (["'])((?:(?!\1).))\1 # the value in the href [^>]> # anything to the end of the anchor [^<>]* # the content in the anchor tag <\/a> # the end of the anchor (?:\s\|<[^>]>)+ # any whitespace or html tags (\d{9}) # the 9 digit number /isxm; my $href = $2; my $number = $3;` [download] Update:* my $re = qr/<a[^>]* # an anchor tag href= # the Href in the anchor (["'])((?:(?!\1).))\1 # the value in the href [^>]> # anything to the end of the anchor [^<>]* # the content in the anchor tag <\/a> # the end of the anchor (?:\s\|<[^>]*>)+ # any whitespace or html tags (\d{9}) # the 9 digit number /isxm; my $string = do{local $/; <DATA>}; while($string =~ m/$re/g){ my $href = $2; my $number = $3; print "$number - $href\n"; } __DATA__ <a name="a"></a> <h2>A</h2> <table border="0" cellpadding="0" cellspacing="0" width="1 +00%"> <tr> <td> <table id="a" border="1" bordercolor="#333366" ce +llpadding="5" cellspacing="0" width="100%"> <tr> <td width="33%" class="clsTableBody" valign +="top" id="firstCol"><a href="pdf\c76b834e-36e1-497b-b13e-eba2348dc04 +4.pdf" target="_blank">Abbott, Evelyn</a><br/><span>110136892</span>< +br/><a href="pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf" target="_b +lank">Agnew, Thomas</a><br/><span>110377660</span><br/></td> <td width="34%" class="clsTableBodyClear" v +align="top" id="secondCol"><a href="pdf\37d3e78b-1adb-458b-9e89-0df78 +0909f08.pdf" target="_blank">Allison, David</a><br/><span>108116112</ +span><br/><a href="pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf" targ +et="_blank">Allison, Gary Owen</a><br/><span>116815754</span><br/></t +d> <td width="33%" class="clsTableBody" valign +="top" id="thirdCol"><a href="pdf\ae8d51e0-005b-44be-84cb-3c9b5733575 +5.pdf" target="_blank">Arsenault, Michael</a><br/><span>108318866</sp +an><br/><a href="pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf" target +="_blank">Arsenault, Normand A.</a><br/><span>113069066</span><br/></ +td> </tr> </table> </td> </tr> </table> [download] Output: `110136892 - pdf\c76b834e-36e1-497b-b13e-eba2348dc04 +4.pdf 110377660 - pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf 108116112 - pdf\37d3e78b-1adb-458b-9e89-0df78 +0909f08.pdf 116815754 - pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf 108318866 - pdf\ae8d51e0-005b-44be-84cb-3c9b5733575 +5.pdf 113069066 - pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf` [download] They say that time changes things, but you actually have to change them yourself. —Andy Warhol	[reply] [d/l] [select]
Re^2: Process a HTML file to get information from it. by Griffler (Novice) on Dec 11, 2006 at 19:11 UTC
Thanks for the hints!!!	[reply]
Re: Process a HTML file to get information from it. by wfsp (Abbot) on Dec 11, 2006 at 18:00 UTC
here's my go #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html = q{ <a name="a"></a> <h2>A</h2> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td> <table width="100%" cellpadding="5" cellspacing="0" border="1"> <tr> <td width="33%" valign="top" class="clsTableBody"> <a href="pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf" targe +t="_blank"> Abbott, Evelyn </a><br /> <span>110136892</span><br /> <a href="pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf" targe +t="_blank"> Agnew, Thomas </a><br /> <span>110377660</span><br /> </td> <td width="34%" valign="top" class="clsTableBodyClear"> <a href="pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf" targe +t="_blank"> Allison, David </a><br /> <span>108116112</span><br /> <a href="pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf" targe +t="_blank"> Allison, Gary Owen </a><br /> <span>116815754</span><br /> </td> <td width="33%" valign="top" class="clsTableBody"> <a href="pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf" targe +t="_blank"> Arsenault, Michael </a><br /> <span>108318866</span><br /> <a href="pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf" targe +t="_blank"> Arsenault, Normand A. </a><br /> <span>113069066</span><br /> </td> </tr> </table> </td> </tr> </table> }; my $p = HTML::TokeParser::Simple->new(\$html); # parse until second table my $table_count = 2; while (my $t = $p->get_tag('table')){ last unless --$table_count; } my (%href, $this_href, $number); while (my $t = $p->get_token){ if ($t->is_start_tag('a')){ $this_href = $t->get_attr('href'); next; } if ($t->is_start_tag('span')){ $number = $p->get_trimmed_text('/span'); $href{$this_href} = $number; next; } last if $t->is_end_tag('table'); } for my $key (keys %href){ print "$key -> $href{$key}\n"; } [download] output: `---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 > Terminated with exit code 0.` [download]	[reply] [d/l] [select]
Re^2: Process a HTML file to get information from it. by Griffler (Novice) on Dec 11, 2006 at 19:18 UTC
This is great but how would I modify this to parse through a file that has that same table structure 25 more time. (Basically One table for each letter of the alphabet.)	[reply]
Re^3: Process a HTML file to get information from it. by wfsp (Abbot) on Dec 12, 2006 at 07:52 UTC
Assuming each letter is in an H2 tag (and that these are the only H2 tags) and that each structure is identical. This should do the trick. We collect the data into a HoH (%href). Hope this helps. `my $p = HTML::TokeParser::Simple->new(\$html); my (%href, $this_href, $number, $letter); while (my $t = $p->get_token){ if ($t->is_start_tag('h2')){ $letter = $p->get_trimmed_text('/h2'); next; } if ($t->is_start_tag('a')){ # skip bookmarks next if $t->get_attr('name'); $this_href = $t->get_attr('href'); next; } if ($t->is_start_tag('span')){ $number = $p->get_trimmed_text('/span'); $href{$letter}{$this_href} = $number; next; } }` [download] output ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl A pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 B pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 C pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 > Terminated with exit code 0.. [download]	[reply] [d/l] [select]
Re: Process a HTML file to get information from it. by GrandFather (Saint) on Dec 11, 2006 at 18:49 UTC
HTML::TreeBuilder is a pretty good tool for this sort of work, especially if the format of the HTML is consistent for the data you need to extract. Consider: `use strict; use warnings; use HTML::TreeBuilder; my $root = HTML::TreeBuilder->new (); $root->parse_file (*DATA); for ($root->look_down ('_tag', 'a')) { my $href = $_->attr ('href'); next if ! $href; my $sib = $_->right ()->right (); my $number = $sib->as_text (); print "$href: $number\n"; } __DATA__` [download] Read more... Data as provided by OP (3 kB) Prints: `pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf: 110136892 pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf: 110377660 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf: 108116112 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf: 116815754 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf: 108318866 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf: 113069066` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: Process a HTML file to get information from it. by Griffler (Novice) on Dec 11, 2006 at 19:06 UTC
I tried your code and I got the following error.... Can't call method "right" without a package or object reference at C:\Change\2539\testit2.pl line 21. I modified the code to look like this: `use strict; use warnings; use HTML::TreeBuilder; my $root = HTML::TreeBuilder->new (); $root->parse_file ("c:\\change\\2539\\index.html"); for ($root->look_down ('_tag', 'a')) { my $href = $_->attr ('href'); next if ! $href; my $sib = $_->right ()->right (); my $number = $sib->as_text (); print "$href: $number\n"; }` [download]	[reply] [d/l]
Re^3: Process a HTML file to get information from it. by chanio (Priest) on Dec 12, 2006 at 16:02 UTC
...got the following error.... Can't call method "right" without a package or object reference at C:\Change\2539\testit2.pl line 21. `my $sib = $_->right ()->right ();` [download] ...should be... `my $sib = ($_->right ())->right (); ## I GUESS` [download] But I don't get any output. Sorry! Landlords production is only eaten by landlords... Wherever I lay my KNOPPIX disk, a new FREE LINUX nation could be established	[reply] [d/l] [select]
Re: Process a HTML file to get information from it. by andyford (Curate) on Dec 11, 2006 at 17:13 UTC
Depending on the surrounding HTML and how static your source is, you might be able to get by without the Parser. Perhaps you could just use a regular expression in quick-but-dirty fashion like this: `/pdf.+?>(.+?)<.+span>(\d{9})<\/span>/;` [download] Then your data might be in $1 and $2. What have you tried? non-Perl: Andy Ford	[reply] [d/l]
Re^2: Process a HTML file to get information from it. by Griffler (Novice) on Dec 11, 2006 at 17:22 UTC
I was using the code sample from the HTML::Parser mod and it parsed out all the href's but I could not figure out how to get the 9 digit number after Here is the code for that I was using use HTML::Parser; my $p = HTML::Parser->new(api_version => 3, start_h => [\&a_start_handler, "self,tagname +,attr"], report_tags => [qw(a img)], ); $p->parse_file(shift \|\| die) \|\| die $!; sub a_start_handler { my($self, $tag, $attr) = @_; return unless $tag eq "a"; return unless exists $attr->{href}; print "A $attr->{href}\n"; $self->handler(text => [], '@{dtext}' ); $self->handler(start => \&img_handler); $self->handler(end => \&a_end_handler, "self,tagname"); } sub img_handler { my($self, $tag, $attr) = @_; return unless $tag eq "img"; push(@{$self->handler("text")}, $attr->{alt} \|\| "[IMG]"); } sub a_end_handler { my($self, $tag) = @_; my $text = join("", @{$self->handler("text")}); $text =~ s/^\s+//; $text =~ s/\s+$//; $text =~ s/\s+/ /g; print "T $text\n"; $self->handler("text", undef); $self->handler("start", \&a_start_handler); $self->handler("end", undef); } [download] The file has a ton of other stuff in it but the what I posted is the main guts.	[reply] [d/l]
Re: Process a HTML file to get information from it. by Popcorn Dave (Abbot) on Dec 11, 2006 at 18:32 UTC
You might have a look at HTML::TokeParser as you should be able to pull out the information as tokens. I wrote a small quick and dirty program to dump HTML to tokens using HTML::Tokeparser that you can find in this node Re: HTML::TokeParser help - parsing headlines . HTH! Revolution. Today, 3 O'Clock. Meet behind the monkey bars.	[reply]
Re: Process a HTML file to get information from it. by Griffler (Novice) on Dec 11, 2006 at 19:52 UTC
Thanks to all who posted This was a great help!	[reply]


Perl: the Markov chain saw
	PerlMonks