comment on

Thanks for the previous.

I have a question about either HTML::TreeBuilder::XPath or HTML::Element, and the interaction between them. I would like to manipulate the content of an element while leaving all its children in place. I'm not able to find a way around that because it appears that replace_with() also automatically and unavoidably escapes the < and > signs. The example below uses ~literal but I've also tried creating a new element. Either way, the child elements within the selected element get escaped despite my best efforts. How would it be possible to do something like the following (using a different work flow if necessary) such that the tags for the child elements remain intact and unescaped?

#!/usr/bin/perl

use HTML::TreeBuilder::XPath;
use HTML::Element;
use warnings;
use strict;

my $xhtml = HTML::TreeBuilder::XPath->new;
$xhtml->implicit_tags(1);
$xhtml->no_space_compacting(1);

$xhtml->parse_file(\*DATA)
        or die("Could not parse file handle for 'DATA' : $!\n");

for my $item ($xhtml->findnodes('//div/ul/li')) {
    my $li = $item->as_XML;

    $li =~ s/^\s+//;
    # ... omitting rest of the stuff which happens to $li ...
    
    my $new = HTML::Element->new('~literal', 'text' => $li);
    $item->replace_with($new);
}

print $xhtml->as_XML_indented;
$xhtml->delete;

exit(0);

__DATA__

<html>
 <head>
  <title>Foo Bar</title>
 </head>
 <body>

  <div><a href=" http://foo.example.com/ ">Foo Bar</a>
   <ul>
    <li> foo foo foo
 foo <em>bar</em> foo
foo foo foo foo
</li></ul></div>

  <div><a href=" http://bar.example.com/ ">Bar Foo</a>
   <ul>
    <li> foo foo foo
 foo <em>bar</em> foo
foo foo foo foo
     <ul>
      <li>alpha</li>
      <li>b<em>et</em>a</li>
      <li>gamma</li>
     </ul>
    </li></ul></div>

 </body>
</html>
[download]

The output I get is as follows:

<html>
  <head>
    <title>Foo Bar</title>
  </head>
  <body>
    <div><a href=" http://foo.example.com/ ">Foo Bar</a>
      <ul>&lt;li&gt; foo foo foo
 foo &lt;em&gt;bar&lt;/em&gt; foo
foo foo foo foo
&lt;/li&gt;

      </ul>
    </div>
    <div><a href=" http://bar.example.com/ ">Bar Foo</a>
      <ul>&lt;li&gt; foo foo foo
 foo &lt;em&gt;bar&lt;/em&gt; foo
foo foo foo foo
     &lt;ul&gt;&lt;li&gt;alpha&lt;/li&gt;&lt;li&gt;b&lt;em&gt;et&lt;/e
+m&gt;a&lt;/li&gt;&lt;li&gt;gamma&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;

      </ul>
    </div>
  </body>
</html>
[download]

The output I would like to get instead would look like this:

<html>
  <head>
    <title>Foo Bar</title>
  </head>
  <body>
    <div><a href=" http://foo.example.com/ ">Foo Bar</a>
      <ul><li>foo foo foo
 foo <em>bar</em> foo
foo foo foo foo
</li>

      </ul>HTML::TreeBuilder::XPath
    </div>
    <div><a href=" http://bar.example.com/ ">Bar Foo</a>
      <ul><li>foo foo foo
 foo <em>bar</em> foo
foo foo foo foo
     <ul><li>alpha</li><li>b<em>et</em>a</li><li>gamma</li></ul></li>

      </ul>
    </div>
  </body>
</html>
[download]

I'm not sure if HTML::TreeBuilder::XPath can be made to work like that. If it can, what has to change?

In reply to Avoiding escaped child elements with HTML::TreeBuilder::XPath or HTML::Element by mldvx4

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Syntactic Confectionery Delight
	PerlMonks