Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Re: How to avoid addition of tags by HTML::TreeBuilder

by tangent (Vicar)
on Apr 20, 2019 at 01:35 UTC ( #1232829=note: print w/replies, xml ) Need Help??

in reply to How to avoid addition of tags by HTML::TreeBuilder

One solution might be to go back to basics and use HTML::Parser, which HTML::TreeBuilder itself uses under the hood.

Although it is not well documented, you can subclass this module and get fine grained control over the parsing, and it will not attempt to make any changes to the document unless you specifically tell it to.

Here is an example which shows how you can combine the parsing with your own transformations:

my $row_html = '<!DOCTYPE html> <body> <p>test</p> </body>'; my $parser = MyParser->new(); $parser->parse($row_html); $parser->eof; print $parser->out; package MyParser; use parent qw(HTML::Parser); sub declaration { my ($self, $decl) = @_; $self->{'out'} .= "<!$decl>"; } sub start { my ($self, $tag, $attr, $attrseq, $text) = @_; if ($tag eq 'p') { $self->{'in_para'} = 1; } $self->{'out'} .= $text; } sub text { my ($self, $text) = @_; if ($self->{'in_para'}) { $text =~ s/google/duckduckgo/; } $self->{'out'} .= $text; } sub end { my ($self, $tag, $text) = @_; if ($tag eq 'p') { $self->{'in_para'} = 0; } $self->{'out'} .= $text; } sub out { my ($self) = @_; return $self->{'out'}; } 1;
<!DOCTYPE html> <body> <p>test</p> </body>

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1232829]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2022-01-18 16:47 GMT
Find Nodes?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:

    Results (53 votes). Check out past polls.