Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Parsing HTML

by deadpickle (Pilgrim)
on Jun 16, 2009 at 07:37 UTC ( [id://771917]=perlquestion: print w/replies, xml ) Need Help??

deadpickle has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to parse a very crazy (IMHO) HTML website. I tried to use XML::LibXML but its so confusing on how I get out what I want. I want to remove the href at the value Level III. I have no idea how to do that. I could use a push in the right direction. Any ideas?
<TABLE width=745 BORDER=0 CELLPADDING=2 cellspacing=0> <TR> <TD colspan=4 bgcolor=green align=center cellpadding=0> <font color=green>.</font> </TD> </TR> <TR> <TD width=25% rowspan=4 BGCOLOR="#DEB887" align=center> <p id="entrypor"><font color=#127d37 face=Helvetica size=3><B>Radar&nb +sp;</font></B><a href="#more-radar" onclick="moreradar()" title="Clic +k to show products and output type selction options.">[-]</a></p> </TD> <TD bgcolor="white" width=35% valign=middle align=left> &nbsp;<span id="entrylinks"><A HREF="HAS.FileAppSelect?datasetname=650 +0">NEXRAD Level II</A></span> </TD> <TD bgcolor="white" width=20% valign=middle align=left> <span id="accesslinkage"><i><A HREF="HAS.FileAppSelect?datasetname=650 +0">access</A> | <A HREF="http://www.ncdc.noaa.gov/oa/documentlibrary/ +surface-doc.html#6500" target="infowin">info</A> | inv | <A HREF="htt +p://www.ncdc.noaa.gov/oa/documentlibrary/index.php?choice=dsi&searchs +tring=6500&submitted=1&submitted=Search" target="docwin">docs</A></I> +</span> </TD> <TD bgcolor="white" width=20% valign=middle align=right> <span id="entrypor">06/05/1991 - 06/15/2009</span> </TD> </tr> <TR> <TD bgcolor="#eaeaea" width=35% valign=middle align=left> &nbsp;<span id="entrylinks"><A HREF="HAS.FileAppSelect?datasetname=700 +0">NEXRAD Level III</A></span> </TD> <TD bgcolor="#eaeaea" width=20% valign=middle align=left> <span id="accesslinkage"><i><A HREF="HAS.FileAppSelect?datasetname=700 +0">access</A> | <A HREF="http://www.ncdc.noaa.gov/oa/documentlibrary/ +surface-doc.html#7000" target="infowin">info</A> | inv | <A HREF="htt +p://www.ncdc.noaa.gov/oa/documentlibrary/index.php?choice=dsi&searchs +tring=7000&submitted=1&submitted=Search" target="docwin">docs</A></I> +</span> </TD> <TD bgcolor="#eaeaea" width=20% valign=middle align=right> <span id="entrypor">05/07/1992 - 06/14/2009</span> </TD> </tr> <TR> </TR> </table>

Replies are listed 'Best First'.
Re: Parsing HTML
by wfsp (Abbot) on Jun 16, 2009 at 08:24 UTC
    #! /usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $t = HTML::TreeBuilder->new_from_file(*DATA) or die qq{cant build tree: $!\n}; my @anchors = $t->look_down(_tag => q{a}); for my $anchor (@anchors){ if ($anchor->as_text eq q{NEXRAD Level III}){ $anchor->replace_with_content; } } print $t->as_HTML( undef, q{ }, {}, ); $t->delete; __DATA__ <TABLE width=745 BORDER=0 CELLPADDING=2 cellspacing=0> <TR> <TD colspan=4 bgcolor=green align=center cellpadding=0> <font color=green>.</font> </TD> </TR> <TR> <TD width=25% rowspan=4 BGCOLOR="#DEB887" align=center> <p id="entrypor"><font color=#127d37 face=Helvetica size=3><B>Radar&nb +sp;</font></B><a href="#more-radar" onclick="moreradar()" title="Clic +k to show products and output type selction options.">[-]</a></p> </TD> <TD bgcolor="white" width=35% valign=middle align=left> &nbsp;<span id="entrylinks"><A HREF="HAS.FileAppSelect?datasetname=650 +0">NEXRAD Level II</A></span> </TD> <TD bgcolor="white" width=20% valign=middle align=left> <span id="accesslinkage"><i><A HREF="HAS.FileAppSelect?datasetname=650 +0">access</A> | <A HREF="http://www.ncdc.noaa.gov/oa/documentlibrary/ +surface-doc.html#6500" target="infowin">info</A> | inv | <A HREF="htt +p://www.ncdc.noaa.gov/oa/documentlibrary/index.php?choice=dsi&searchs +tring=6500&submitted=1&submitted=Search" target="docwin">docs</A></I> +</span> </TD> <TD bgcolor="white" width=20% valign=middle align=right> <span id="entrypor">06/05/1991 - 06/15/2009</span> </TD> </tr> <TR> <TD bgcolor="#eaeaea" width=35% valign=middle align=left> &nbsp;<span id="entrylinks"><A HREF="HAS.FileAppSelect?datasetname=700 +0">NEXRAD Level III</A></span> </TD> <TD bgcolor="#eaeaea" width=20% valign=middle align=left> <span id="accesslinkage"><i><A HREF="HAS.FileAppSelect?datasetname=700 +0">access</A> | <A HREF="http://www.ncdc.noaa.gov/oa/documentlibrary/ +surface-doc.html#7000" target="infowin">info</A> | inv | <A HREF="htt +p://www.ncdc.noaa.gov/oa/documentlibrary/index.php?choice=dsi&searchs +tring=7000&submitted=1&submitted=Search" target="docwin">docs</A></I> +</span> </TD> <TD bgcolor="#eaeaea" width=20% valign=middle align=right> <span id="entrypor">05/07/1992 - 06/14/2009</span> </TD> </tr> <TR> </TR> </table>
    output (extract plus whitespace)
    <td align="left" bgcolor="#eaeaea" valign="middle" width="35%"> &nbsp; <span id="entrylinks">NEXRAD Level III</span> </td>
      Thanks for the replies. I do appreciate it. Here is what I have so far:
      #!/usr/bin/perl -w use strict; use HTTP::Lite; use HTML::TreeBuilder; my $http = new HTTP::Lite; my $req = $http->request("http://has.ncdc.noaa.gov/pls/plhas/has.dssel +ect") or die "Unable to get document: $!"; die "Request failed ($req): ".$http->status_message() if $req ne "200" +; my $body = $http->body(); my $t = HTML::TreeBuilder->new_from_content($body) or die qq{cant buil +d tree: $!\n}; my @anchors = $t->look_down(_tag => q{a}); for my $anchor (@anchors){ if ($anchor->as_text eq q{NEXRAD Level III}){ $anchor->replace_with_content; } } print $t->as_HTML( undef, q{ }, {}, ); $t->delete;
      This, of coarse is not much different from the above code. I just dont understand this HTML stuff, and cant seem to find any pages to really help. I want to Take out the text contained in href when the text is Level III and put it into a variable. Is href a node, namespace, or what?
        my @anchors = $t->look_down(_tag => q{a}); my @hrefs; for my $anchor (@anchors){ if ($anchor->as_text eq q{NEXRAD Level III}){ push @hrefs, $anchor->attr(q{href}); $anchor->replace_with_content; } } print qq{$_\n} for @hrefs;
        HAS.FileAppSelect?datasetname=7000
        An href is attribute of an HTML element, hence $anchor->attr(q{href}) :-)

        Have a look at the HTML::Element docs to see what the look_down, as_text, attr and replace_with_content methods do.

        update

        heh! Looking back at your original question I saw

        I want to remove the href...
        and I read that as removing the anchor tag from the HTML. :-)

        Did you mean you wanted to get/extract the hrefs and store them in an array? If so you don't need the

        $anchor->replace_with_content;
        line. I'll get there in the end. :-)
Re: Parsing HTML
by Anonymous Monk on Jun 16, 2009 at 07:44 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://771917]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-04-25 16:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found