Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

How to avoid addition of tags by HTML::TreeBuilder

by phoenix007 (Sexton)
on Apr 19, 2019 at 07:04 UTC ( [id://1232789]=perlquestion: print w/replies, xml ) Need Help??

phoenix007 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to modify HTML using TreeBuilder but in the output of as_HTML I am getting some additional tags that are not present in input file. I want input HTML as it is except when I modify it. Current code shows how TreeBuilder is modifying input HTML. Can anyone suggest me any option to avoid input HTML modification except when explicitly done by me

use HTML::TreeBuilder; my $row_html = '<!DOCTYPE html> <body> <p>test https://www.google.com</p> </body>'; my $html = HTML::TreeBuilder->new; $html->ignore_ignorable_whitespace(0); $html->no_space_compacting(1); $html->store_comments(1); $html->parse($row_html); # i will do some modifications to HTML here my $output_html = $html->as_HTML(undef,undef,{}); print $output_html;

Current Output with HTML and HEAD tags added by TreeBuilder :

<!DOCTYPE html> <html><head></head><body> <p>test https://www.google.com</p> </body> </html>

I was expecting following output (same as input as im not changing anything in HTML as of now) :

<!DOCTYPE html> <body> <p>test https://www.google.com</p> </body>

Replies are listed 'Best First'.
Re: How to avoid addition of tags by HTML::TreeBuilder
by tangent (Parson) on Apr 20, 2019 at 01:35 UTC
    One solution might be to go back to basics and use HTML::Parser, which HTML::TreeBuilder itself uses under the hood.

    Although it is not well documented, you can subclass this module and get fine grained control over the parsing, and it will not attempt to make any changes to the document unless you specifically tell it to.

    Here is an example which shows how you can combine the parsing with your own transformations:

    my $row_html = '<!DOCTYPE html> <body> <p>test https://www.google.com</p> </body>'; my $parser = MyParser->new(); $parser->parse($row_html); $parser->eof; print $parser->out; package MyParser; use parent qw(HTML::Parser); sub declaration { my ($self, $decl) = @_; $self->{'out'} .= "<!$decl>"; } sub start { my ($self, $tag, $attr, $attrseq, $text) = @_; if ($tag eq 'p') { $self->{'in_para'} = 1; } $self->{'out'} .= $text; } sub text { my ($self, $text) = @_; if ($self->{'in_para'}) { $text =~ s/google/duckduckgo/; } $self->{'out'} .= $text; } sub end { my ($self, $tag, $text) = @_; if ($tag eq 'p') { $self->{'in_para'} = 0; } $self->{'out'} .= $text; } sub out { my ($self) = @_; return $self->{'out'}; } 1;
    Output:
    <!DOCTYPE html> <body> <p>test https://www.duckduckgo.com</p> </body>
Re: How to avoid addition of tags by HTML::TreeBuilder
by marto (Cardinal) on Apr 19, 2019 at 08:13 UTC

    You should take a look at Mojo::DOM for manipulating the HTML.

Re: How to avoid addition of tags by HTML::TreeBuilder
by Veltro (Hermit) on Apr 19, 2019 at 08:16 UTC

    It seems it adds the html and head tags for convenience, you can just get the body like this:

    my $body = $root->find('body'); print $body->as_HTML;

    edit: Although, now it seems that the ending /p tags go missing and I am not sure why. Maybe the module is not perfect, try this as a workaround:

    my @foo = $html->look_down( _tag => "body", ) ; foreach( @foo ) { my $output_html = $_->as_HTML(undef,undef,{}); print $output_html; last ; }
Re: How to avoid addition of tags by HTML::TreeBuilder
by tobyink (Canon) on Apr 19, 2019 at 21:53 UTC

    I mean, from a specification point of view, it's technically not changing the document. Just as <p foo="1" bar="2"></p> and <p bar="2" foo="1"></p> are considered exactly equivalent by the HTML specs, the expected output and actual output are exactly equivalent.

    Given that the expected output and actual output are equivalent and will be treated identically by any browsers and other software that conforms to the HTML specs, I feel like we're missing some bit of information… what about the current output is unacceptable to you?

Re: How to avoid addition of tags by HTML::TreeBuilder
by Anonymous Monk on Apr 19, 2019 at 07:13 UTC
    $tree->ignore_unknown(0);; $tree->implicit_tags(0); $tree->no_expand_entities(1); $tree->ignore_unknown(0); $tree->ignore_ignorable_whitespace(0); $tree->no_space_compacting(1); $tree->store_comments(1); $tree->store_pis(1);

      Not Working : Tried by setting options provided by you

      Output after setting your options :

      <!DOCTYPE html> <html><head></head><body></body> <body> <p>test https://www.google.com</p> </body></html>

      Expected output : (Same as input)

      <!DOCTYPE html> <body> <p>test https://www.google.com</p> </body>

        The expected output is illegal HTML; in fact so is the Tree builder version. HTML5 requires the title. Getting tools to produce incorrect output is usually be outside their scope.

        If you always have the same template but differing bodies, you could just use the tree to print the body content into your template. Otherwise there might be a limited number of cases you could convert into a heuristic tree with matching template pieces to get what you want.

        That's as good as it gets with Treebuilder

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1232789]
Approved by Athanasius
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2024-04-19 23:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found