How to extract untouched content of html tag with HTML::Parser

Lana has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!! I am trying to use HTML::Parser to extract data of specified HTML tag:

use strict;
use HTML::Parser;

my $content=<<EOF;
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>Some title goes here</title>
</head>
<body bgcolor="#FFFFFF">
  <div id="leftcol">
    menu column
  </div>
  <div id="body">
    <p>some text goes here some text goes here<br />
    some text goes here some text goes here</p>
    <p><strong>some header</strong></p>
    <p>some text goes here some text goes here<br />
    some text goes here some text goes here</p>
    <p><img src="img.gif" /> image here</p>
    <p><strong>some header</strong></p>
    <p>some text goes here some text goes here<br />
    some text goes here some text goes here</p>
  </div>
  <div id="rightcol">
    news column
  </div>
</body>
</html>
EOF

my $p = HTML::Parser->new( api_version => 3 );
$p->handler( start => \&start_handler, "self,tagname,attr" );
$p->parse($content);
exit;


sub start_handler {
    my $self = shift;
    my $tagname  = shift;
    my $attr = shift;
    return unless ( $tagname eq 'div' and $attr->{id} eq 'body' );
    $self->handler( text => sub { print shift }, "dtext" );
    $self->handler(end  => sub { shift->eof if shift eq $tagname; }, "
+tagname,self");
}
[download]

In this simplified example it strips HTML inside the <div id="body">...</div> and prints out just text, but I need all html formatting to be untouched. How to achieve this? Thanks! :)

Comment on How to extract untouched content of html tag with HTML::Parser Download Code

Replies are listed 'Best First'.
Re: How to extract untouched content of html tag with HTML::Parser by ig (Vicar) on Nov 28, 2010 at 17:20 UTC
I need all html formatting to be untouched Maybe including start and end tags within the div would give you what you want. `sub start_handler { my $self = shift; my $tagname = shift; my $attr = shift; my $text = shift; return unless ( $tagname eq 'div' and $attr->{id} eq 'body' ); $self->handler( start => sub { print shift }, "text" ); $self->handler( text => sub { print shift }, "text" ); $self->handler(end => sub { my ($endtagname, $self, $text) = @_; if($endtagname eq $tagname) { $self->eof; } else { print $text; } }, "tagname,self,text"); }` [download]	[reply] [d/l]
Re^2: How to extract untouched content of html tag with HTML::Parser by Lana (Beadle) on Nov 28, 2010 at 17:35 UTC
yeah!! thank you! it worked! I see my mistake :)	[reply]
Re^3: How to extract untouched content of html tag with HTML::Parser by Anonymous Monk on Nov 28, 2010 at 20:53 UTC
FYI, shift not required, you can `print @_`	[reply] [d/l]
Re^2: How to extract untouched content of html tag with HTML::Parser by SneakZa (Initiate) on May 28, 2013 at 16:34 UTC
How do you save the output to a varible so it can be used later?	[reply]
Re^3: How to extract untouched content of html tag with HTML::Parser by ig (Vicar) on Jul 26, 2013 at 17:15 UTC
That depends what you mean by later. Perhaps something like the following would work for you? sub start_handler { my $self = shift; my $tagname = shift; my $attr = shift; my $text = shift; my $variable = ''; return unless ( $tagname eq 'div' and $attr->{id} eq 'body' ); $self->handler( start => sub { $variable .= shift }, "text" ); $self->handler( text => sub { $variable .= shift }, "text" ); $self->handler(end => sub { my ($endtagname, $self, $text) = @_; if($endtagname eq $tagname) { later($variable); $self->eof; } else { $variable .= $text; } }, "tagname,self,text"); } sub later { my ($variable) = @_; ## do something with $variable } [download]	[reply] [d/l]
Re: How to extract untouched content of html tag with HTML::Parser by roboticus (Chancellor) on Nov 28, 2010 at 16:09 UTC
Lana: I've not used it in a while, but as I read the documentation, I'd suggest passing "text" rather than "dtext" to the handler specification so it can print the original text rather than the decoded text. ...roboticus	[reply]
Re^2: How to extract untouched content of html tag with HTML::Parser by Lana (Beadle) on Nov 28, 2010 at 16:11 UTC
I wish it was that simple :) But it isn't :(	[reply]
Re^3: How to extract untouched content of html tag with HTML::Parser by Anonymous Monk on Nov 28, 2010 at 17:26 UTC
It is that easy. You have a logic error. Your start handler, which you call start_handler, does no printing. You text handler does printing, but as documented, the text handler handles text not start tags. Also, your end handler does no printing.	[reply]
Re^4: How to extract untouched content of html tag with HTML::Parser by Lana (Beadle) on Nov 28, 2010 at 17:33 UTC
Re^5: How to extract untouched content of html tag with HTML::Parser by Anonymous Monk on Nov 28, 2010 at 17:36 UTC
Re^3: How to extract untouched content of html tag with HTML::Parser by roboticus (Chancellor) on Nov 28, 2010 at 16:40 UTC
OK, then, did you look at the `htstrip` example in the distribution? The documentation (at the end of the EXAMPLES section) indicates that you can modify it to do what you want: More examples are found in the eg/ directory of the HTML-Parser distribution: the program hrefsub shows how you can edit all links found in a document; the program htextsub shows how to edit the text only; the program hstrip shows how you can strip out certain tags/elements and/or attributes; and the program htext show how to obtain the plain text, but not any script/style content. ...roboticus	[reply] [d/l]
Re^4: How to extract untouched content of html tag with HTML::Parser by Lana (Beadle) on Nov 28, 2010 at 17:22 UTC


go ahead... be a heretic
	PerlMonks