Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

How to extract untouched content of html tag with HTML::Parser

by Lana (Beadle)
on Nov 28, 2010 at 15:54 UTC ( [id://874107]=perlquestion: print w/replies, xml ) Need Help??

Lana has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!! I am trying to use HTML::Parser to extract data of specified HTML tag:
use strict; use HTML::Parser; my $content=<<EOF; <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Some title goes here</title> </head> <body bgcolor="#FFFFFF"> <div id="leftcol"> menu column </div> <div id="body"> <p>some text goes here some text goes here<br /> some text goes here some text goes here</p> <p><strong>some header</strong></p> <p>some text goes here some text goes here<br /> some text goes here some text goes here</p> <p><img src="img.gif" /> image here</p> <p><strong>some header</strong></p> <p>some text goes here some text goes here<br /> some text goes here some text goes here</p> </div> <div id="rightcol"> news column </div> </body> </html> EOF my $p = HTML::Parser->new( api_version => 3 ); $p->handler( start => \&start_handler, "self,tagname,attr" ); $p->parse($content); exit; sub start_handler { my $self = shift; my $tagname = shift; my $attr = shift; return unless ( $tagname eq 'div' and $attr->{id} eq 'body' ); $self->handler( text => sub { print shift }, "dtext" ); $self->handler(end => sub { shift->eof if shift eq $tagname; }, " +tagname,self"); }
In this simplified example it strips HTML inside the <div id="body">...</div> and prints out just text, but I need all html formatting to be untouched. How to achieve this? Thanks! :)

Replies are listed 'Best First'.
Re: How to extract untouched content of html tag with HTML::Parser
by ig (Vicar) on Nov 28, 2010 at 17:20 UTC
    I need all html formatting to be untouched

    Maybe including start and end tags within the div would give you what you want.

    sub start_handler { my $self = shift; my $tagname = shift; my $attr = shift; my $text = shift; return unless ( $tagname eq 'div' and $attr->{id} eq 'body' ); $self->handler( start => sub { print shift }, "text" ); $self->handler( text => sub { print shift }, "text" ); $self->handler(end => sub { my ($endtagname, $self, $text) = @_; if($endtagname eq $tagname) { $self->eof; } else { print $text; } }, "tagname,self,text"); }
      yeah!! thank you! it worked! I see my mistake :)
        FYI, shift not required, you can print @_
      How do you save the output to a varible so it can be used later?

        That depends what you mean by later. Perhaps something like the following would work for you?

        sub start_handler { my $self = shift; my $tagname = shift; my $attr = shift; my $text = shift; my $variable = ''; return unless ( $tagname eq 'div' and $attr->{id} eq 'body' ); $self->handler( start => sub { $variable .= shift }, "text" ); $self->handler( text => sub { $variable .= shift }, "text" ); $self->handler(end => sub { my ($endtagname, $self, $text) = @_; if($endtagname eq $tagname) { later($variable); $self->eof; } else { $variable .= $text; } }, "tagname,self,text"); } sub later { my ($variable) = @_; ## do something with $variable }
Re: How to extract untouched content of html tag with HTML::Parser
by roboticus (Chancellor) on Nov 28, 2010 at 16:09 UTC

    Lana:

    I've not used it in a while, but as I read the documentation, I'd suggest passing "text" rather than "dtext" to the handler specification so it can print the original text rather than the decoded text.

    ...roboticus

      I wish it was that simple :) But it isn't :(
        It is that easy. You have a logic error. Your start handler, which you call start_handler, does no printing. You text handler does printing, but as documented, the text handler handles text not start tags. Also, your end handler does no printing.

        OK, then, did you look at the htstrip example in the distribution? The documentation (at the end of the EXAMPLES section) indicates that you can modify it to do what you want:

        More examples are found in the eg/ directory of the HTML-Parser distribution: the program hrefsub shows how you can edit all links found in a document; the program htextsub shows how to edit the text only; the program hstrip shows how you can strip out certain tags/elements and/or attributes; and the program htext show how to obtain the plain text, but not any script/style content.

        ...roboticus

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://874107]
Approved by lidden
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (2)
As of 2024-04-24 23:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found