Avoiding escaped child elements with HTML::TreeBuilder::XPath or HTML::Element

mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

Thanks for the previous.

I have a question about either HTML::TreeBuilder::XPath or HTML::Element, and the interaction between them. I would like to manipulate the content of an element while leaving all its children in place. I'm not able to find a way around that because it appears that replace_with() also automatically and unavoidably escapes the < and > signs. The example below uses ~literal but I've also tried creating a new element. Either way, the child elements within the selected element get escaped despite my best efforts. How would it be possible to do something like the following (using a different work flow if necessary) such that the tags for the child elements remain intact and unescaped?

#!/usr/bin/perl

use HTML::TreeBuilder::XPath;
use HTML::Element;
use warnings;
use strict;

my $xhtml = HTML::TreeBuilder::XPath->new;
$xhtml->implicit_tags(1);
$xhtml->no_space_compacting(1);

$xhtml->parse_file(\*DATA)
        or die("Could not parse file handle for 'DATA' : $!\n");

for my $item ($xhtml->findnodes('//div/ul/li')) {
    my $li = $item->as_XML;

    $li =~ s/^\s+//;
    # ... omitting rest of the stuff which happens to $li ...
    
    my $new = HTML::Element->new('~literal', 'text' => $li);
    $item->replace_with($new);
}

print $xhtml->as_XML_indented;
$xhtml->delete;

exit(0);

__DATA__

<html>
 <head>
  <title>Foo Bar</title>
 </head>
 <body>

  <div><a href=" http://foo.example.com/ ">Foo Bar</a>
   <ul>
    <li> foo foo foo
 foo <em>bar</em> foo
foo foo foo foo
</li></ul></div>

  <div><a href=" http://bar.example.com/ ">Bar Foo</a>
   <ul>
    <li> foo foo foo
 foo <em>bar</em> foo
foo foo foo foo
     <ul>
      <li>alpha</li>
      <li>b<em>et</em>a</li>
      <li>gamma</li>
     </ul>
    </li></ul></div>

 </body>
</html>
[download]

The output I get is as follows:

<html>
  <head>
    <title>Foo Bar</title>
  </head>
  <body>
    <div><a href=" http://foo.example.com/ ">Foo Bar</a>
      <ul>&lt;li&gt; foo foo foo
 foo &lt;em&gt;bar&lt;/em&gt; foo
foo foo foo foo
&lt;/li&gt;

      </ul>
    </div>
    <div><a href=" http://bar.example.com/ ">Bar Foo</a>
      <ul>&lt;li&gt; foo foo foo
 foo &lt;em&gt;bar&lt;/em&gt; foo
foo foo foo foo
     &lt;ul&gt;&lt;li&gt;alpha&lt;/li&gt;&lt;li&gt;b&lt;em&gt;et&lt;/e
+m&gt;a&lt;/li&gt;&lt;li&gt;gamma&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;

      </ul>
    </div>
  </body>
</html>
[download]

The output I would like to get instead would look like this:

<html>
  <head>
    <title>Foo Bar</title>
  </head>
  <body>
    <div><a href=" http://foo.example.com/ ">Foo Bar</a>
      <ul><li>foo foo foo
 foo <em>bar</em> foo
foo foo foo foo
</li>

      </ul>HTML::TreeBuilder::XPath
    </div>
    <div><a href=" http://bar.example.com/ ">Bar Foo</a>
      <ul><li>foo foo foo
 foo <em>bar</em> foo
foo foo foo foo
     <ul><li>alpha</li><li>b<em>et</em>a</li><li>gamma</li></ul></li>

      </ul>
    </div>
  </body>
</html>
[download]

I'm not sure if HTML::TreeBuilder::XPath can be made to work like that. If it can, what has to change?

Comment on Avoiding escaped child elements with HTML::TreeBuilder::XPath or HTML::Element Select or Download Code

Replies are listed 'Best First'.
Re: Avoiding escaped child elements with HTML::TreeBuilder::XPath or HTML::Element by haukex (Archbishop) on Nov 15, 2021 at 10:39 UTC
`# ... omitting rest of the stuff which happens to $li ...` This is actually the important bit. A DOM tree is a tree of objects, so in your call to `$item->replace_with($new);`, `$new` needs to be a tree of objects representing the HTML that you want to insert, not just a single text node. One would normally do this by directly manipulating the objects in the tree, or building a new subtree to replace the old one. But you haven't told us what manipulations you wish to do, so it's difficult to make a more specific recommendation. Your expected output is identical to your input except for whitespace changes (and the insertion of "`HTML::TreeBuilder::XPath`", which I am guessing might be a mistake), but because whitespace is insignificant in many places in HTML/XML, I can't tell what manipulations you might want to make here.	[reply] [d/l] [select]
Re: Avoiding escaped child elements with HTML::TreeBuilder::XPath or HTML::Element by marto (Cardinal) on Nov 15, 2021 at 10:44 UTC
Apart from adding the text 'HTML::TreeBuilder::XPath' and in places some whitespace (no impact), what are you looking to change? Regardless, Mojo::DOM is my go to for DOM manipulation. See your previous question Data structure question from XML::XPath::XMLParser.	[reply]
Re^2: Avoiding escaped child elements with HTML::TreeBuilder::XPath or HTML::Element by mldvx4 (Friar) on Nov 15, 2021 at 12:39 UTC
Thanks haukex and marto. In this case, I just want to trim the unnecessary white space from the start and end of a few elements and attributes. The attributes are easy to work with so that is solved. However, I am not sure how to apply a substitution, `s///`, to an element containing more that just text.	[reply] [d/l]
Re^3: Avoiding escaped child elements with HTML::TreeBuilder::XPath or HTML::Element by haukex (Archbishop) on Nov 15, 2021 at 12:50 UTC
However, I am not sure how to apply a substitution, s///, to an element containing more that just text. The documentation of `HTML::Element`'s `content_refs_list` gives an example of how to modify text nodes contained in an element and the documentation of HTML::Element::traverse shows how to use a recursive function to walk the tree. Putting those together: `sub html_trim { my $elem = shift; for my $itemref ($elem->content_refs_list) { if ( ref $$itemref ) { html_trim($$itemref) } # remove this for non-recursive else { $$itemref =~ s/^\s+\|\s+$//g } } } for my $elem ($xhtml->findnodes('//div/ul/li')) { html_trim($elem) }` [download]	[reply] [d/l] [select]
Re: Avoiding escaped child elements with HTML::TreeBuilder::XPath or HTML::Element by tangent (Parson) on Nov 15, 2021 at 14:19 UTC
Many of the tools used to parse HTML use HTML::Parser under the hood, and it is worthwhile knowing how it works. This script gathers up all the content of each list item, including other elements, into a variable. When it meets the closing list item tag, you can do what you need to the content before printing it out. use HTML::Parser; my $inside_li = 0; my $list_item = ''; sub start { my ($tag, $text) = @_; if ($inside_li) { $list_item .= $text; return; } if ($tag eq 'li') { $inside_li = 1; } print $text; }; sub text { my ($text) = @_; if ($inside_li) { $list_item .= $text; return; } print $text; }; sub end { my ($tag, $text) = @_; if ($tag eq 'li') { $inside_li = 0; # do things to <li> content $list_item =~ s/^\s+//; print $list_item; $list_item = ''; } if ($inside_li) { $list_item .= $text; return; } print $text; }; my $parser = HTML::Parser->new( api_version => 3, start_h => [\&start, "tagname, text"], text_h => [\&text, "text"], end_h => [\&end, "tagname, text"], default_h => [\&text, "text"], ); $parser->parse_file(\*DATA); [download]	[reply] [d/l]
Re: Avoiding escaped child elements with HTML::TreeBuilder::XPath or HTML::Element by Anonymous Monk on Nov 15, 2021 at 12:45 UTC
~literal is HTML::Element stuff. as_XML_indented is HTML::TreeBuilder::XPath stuff. Try https://metacpan.org/pod/HTML::Element#as_XML.	[reply]

Back to Seekers of Perl Wisdom