http://qs321.pair.com?node_id=1232829


in reply to How to avoid addition of tags by HTML::TreeBuilder

One solution might be to go back to basics and use HTML::Parser, which HTML::TreeBuilder itself uses under the hood.

Although it is not well documented, you can subclass this module and get fine grained control over the parsing, and it will not attempt to make any changes to the document unless you specifically tell it to.

Here is an example which shows how you can combine the parsing with your own transformations:

my $row_html = '<!DOCTYPE html> <body> <p>test https://www.google.com</p> </body>'; my $parser = MyParser->new(); $parser->parse($row_html); $parser->eof; print $parser->out; package MyParser; use parent qw(HTML::Parser); sub declaration { my ($self, $decl) = @_; $self->{'out'} .= "<!$decl>"; } sub start { my ($self, $tag, $attr, $attrseq, $text) = @_; if ($tag eq 'p') { $self->{'in_para'} = 1; } $self->{'out'} .= $text; } sub text { my ($self, $text) = @_; if ($self->{'in_para'}) { $text =~ s/google/duckduckgo/; } $self->{'out'} .= $text; } sub end { my ($self, $tag, $text) = @_; if ($tag eq 'p') { $self->{'in_para'} = 0; } $self->{'out'} .= $text; } sub out { my ($self) = @_; return $self->{'out'}; } 1;
Output:
<!DOCTYPE html> <body> <p>test https://www.duckduckgo.com</p> </body>