How to avoid addition of tags by HTML::TreeBuilder

phoenix007 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to modify HTML using TreeBuilder but in the output of as_HTML I am getting some additional tags that are not present in input file. I want input HTML as it is except when I modify it. Current code shows how TreeBuilder is modifying input HTML. Can anyone suggest me any option to avoid input HTML modification except when explicitly done by me

use HTML::TreeBuilder;

my $row_html = '<!DOCTYPE html>
<body>
<p>test https://www.google.com</p>
</body>';

my $html = HTML::TreeBuilder->new;
$html->ignore_ignorable_whitespace(0);
$html->no_space_compacting(1);
$html->store_comments(1);
$html->parse($row_html);
# i will do some modifications to HTML here
my $output_html = $html->as_HTML(undef,undef,{});
print $output_html;
[download]

Current Output with HTML and HEAD tags added by TreeBuilder :

<!DOCTYPE html>
<html><head></head><body>
<p>test https://www.google.com</p>
</body>
</html>
[download]

I was expecting following output (same as input as im not changing anything in HTML as of now) :

<!DOCTYPE html>
<body>
<p>test https://www.google.com</p>
</body>
[download]

Comment on How to avoid addition of tags by HTML::TreeBuilder Select or Download Code

Replies are listed 'Best First'.
Re: How to avoid addition of tags by HTML::TreeBuilder by tangent (Parson) on Apr 20, 2019 at 01:35 UTC
One solution might be to go back to basics and use HTML::Parser, which HTML::TreeBuilder itself uses under the hood. Although it is not well documented, you can subclass this module and get fine grained control over the parsing, and it will not attempt to make any changes to the document unless you specifically tell it to. Here is an example which shows how you can combine the parsing with your own transformations: my $row_html = '<!DOCTYPE html> <body> <p>test https://www.google.com</p> </body>'; my $parser = MyParser->new(); $parser->parse($row_html); $parser->eof; print $parser->out; package MyParser; use parent qw(HTML::Parser); sub declaration { my ($self, $decl) = @_; $self->{'out'} .= "<!$decl>"; } sub start { my ($self, $tag, $attr, $attrseq, $text) = @_; if ($tag eq 'p') { $self->{'in_para'} = 1; } $self->{'out'} .= $text; } sub text { my ($self, $text) = @_; if ($self->{'in_para'}) { $text =~ s/google/duckduckgo/; } $self->{'out'} .= $text; } sub end { my ($self, $tag, $text) = @_; if ($tag eq 'p') { $self->{'in_para'} = 0; } $self->{'out'} .= $text; } sub out { my ($self) = @_; return $self->{'out'}; } 1; [download] Output: `<!DOCTYPE html> <body> <p>test https://www.duckduckgo.com</p> </body>` [download]	[reply] [d/l] [select]
Re: How to avoid addition of tags by HTML::TreeBuilder by marto (Cardinal) on Apr 19, 2019 at 08:13 UTC
You should take a look at Mojo::DOM for manipulating the HTML.	[reply]
Re: How to avoid addition of tags by HTML::TreeBuilder by Veltro (Hermit) on Apr 19, 2019 at 08:16 UTC
It seems it adds the html and head tags for convenience, you can just get the body like this: `my $body = $root->find('body'); print $body->as_HTML;` [download] edit: Although, now it seems that the ending `/p` tags go missing and I am not sure why. Maybe the module is not perfect, try this as a workaround: `my @foo = $html->look_down( _tag => "body", ) ; foreach( @foo ) { my $output_html = $_->as_HTML(undef,undef,{}); print $output_html; last ; }` [download]	[reply] [d/l] [select]
Re: How to avoid addition of tags by HTML::TreeBuilder by tobyink (Canon) on Apr 19, 2019 at 21:53 UTC
I mean, from a specification point of view, it's technically not changing the document. Just as `<p foo="1" bar="2"></p>` and `<p bar="2" foo="1"></p>` are considered exactly equivalent by the HTML specs, the expected output and actual output are exactly equivalent. Given that the expected output and actual output are equivalent and will be treated identically by any browsers and other software that conforms to the HTML specs, I feel like we're missing some bit of information… what about the current output is unacceptable to you? toby döt ink	[reply] [d/l] [select]
Re: How to avoid addition of tags by HTML::TreeBuilder by Anonymous Monk on Apr 19, 2019 at 07:13 UTC
`$tree->ignore_unknown(0);; $tree->implicit_tags(0); $tree->no_expand_entities(1); $tree->ignore_unknown(0); $tree->ignore_ignorable_whitespace(0); $tree->no_space_compacting(1); $tree->store_comments(1); $tree->store_pis(1);` [download]	[reply] [d/l]
Re^2: How to avoid addition of tags by HTML::TreeBuilder by phoenix007 (Sexton) on Apr 19, 2019 at 07:23 UTC
Not Working : Tried by setting options provided by you Output after setting your options : `<!DOCTYPE html> <html><head></head><body></body> <body> <p>test https://www.google.com</p> </body></html>` [download] Expected output : (Same as input) `<!DOCTYPE html> <body> <p>test https://www.google.com</p> </body>` [download]	[reply] [d/l] [select]
Re^3: How to avoid addition of tags by HTML::TreeBuilder by Your Mother (Archbishop) on Apr 19, 2019 at 14:33 UTC
The expected output is illegal HTML; in fact so is the Tree builder version. HTML5 requires the title. Getting tools to produce incorrect output is usually be outside their scope. If you always have the same template but differing bodies, you could just use the tree to print the body content into your template. Otherwise there might be a limited number of cases you could convert into a heuristic tree with matching template pieces to get what you want.	[reply]
Re^4: How to avoid addition of tags by HTML::TreeBuilder by Anonymous Monk on Apr 19, 2019 at 16:55 UTC
Re^5: How to avoid addition of tags by HTML::TreeBuilder by Your Mother (Archbishop) on Apr 19, 2019 at 17:03 UTC
Re^5: How to avoid addition of tags by HTML::TreeBuilder by karlgoethebier (Abbot) on Apr 20, 2019 at 14:17 UTC
Some notes below your chosen depth have not been shown here
Re^3: How to avoid addition of tags by HTML::TreeBuilder by Anonymous Monk on Apr 19, 2019 at 07:45 UTC
That's as good as it gets with Treebuilder	[reply]


Perl-Sensitive Sunglasses
	PerlMonks