Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Seemingly Valid HTML which crashes HTML::TreeBuilder::XPath

by mldvx4 (Friar)
on Nov 10, 2023 at 10:54 UTC ( [id://11155539] : perlquestion . print w/replies, xml ) Need Help??

mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

Greetings.

Perhaps this is bug with HTML Tidy. One of my scripts was subjected to some HTML today which passed through HTML Tidy but nonetheless crashed HTML::TreeBuilder::XPath. Below is a stripped down sample. The following Perl script produces the error message "Can't locate object method "as_XML_indented" via package " trololo " (perhaps you forgot to load " trololo "?) at ./script.pl line 12." and does not proceed through the rest of the script.

#!/usr/bin/perl use HTML::TreeBuilder::XPath; use strict; use warnings; my $tree = HTML::TreeBuilder::XPath->new_from_file(\*DATA); for my $body ($tree->findnodes('//body')) { for my $element ($body->detach_content) { print $element->as_XML_indented; } } print "\n"; print "OK\n"; exit(0); __DATA__ <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content= "HTML Tidy for HTML5 for Linux version 5.6.0" /> <title></title> </head> <body> <p>foo</p> <p>bar</p> trololo </body> </html>

Since the HTML seems to be valid, having just passed through HTML Tidy, I would have expected as_XML_indented to have just plowed through it, either rendering it as XML or at least not stopping. A work-around has been to wrap it in an eval,

#!/usr/bin/perl use HTML::TreeBuilder::XPath; use strict; use warnings; my $tree = HTML::TreeBuilder::XPath->new_from_file(\*DATA); for my $body ($tree->findnodes('//body')) { for my $element ($body->detach_content) { eval { print $element->as_XML_indented; }; if ($@) { print STDERR qq(\n),$@,qq(\n); print STDERR qq(Failed HTML.\n); } } print "\n"; print "OK\n"; exit(0); __DATA__ <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content= "HTML Tidy for HTML5 for Linux version 5.6.0" /> <title></title> </head> <body> <p>foo</p> <p>bar</p> trololo </body> </html>

I'm not sure how to interpret the HTML5 spec. However, the HTML4 spec seems to indicate that the loose text ought to have been wrapped in a block element of some kind.

So if I may tap your collective wisdom,

  1. Is this a bug with HTML::TreeBuilder::XPath or with HTML Tidy?
  2. For error trapping:
    1. Is there a better way to do eval?
    2. Or should try / catch be used?

General comments and advice also welcome.

Replies are listed 'Best First'.
Re: Seemingly Valid HTML which crashes HTML::TreeBuilder::XPath
by choroba (Cardinal) on Nov 10, 2023 at 11:40 UTC
    It's not a bug, but I'd say it's a bad design decision of HTML::Element to represent text nodes as strings instead of objects (which is what for example XML::LibXML does via XML::LibXML::Text). It can be somehow fixed by calling
    $body->objectify_text;
    before messing with its contents. See objectify_text.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      The objectify_text call just seems to invert the problem. Though I can be rather obtuse and may not see the right way to use it.

      I might be able to fit XML::LibXML into the full script and replace HTML::TreeBuilder::XPath. Here is my sketch,

      #!/usr/bin/perl use XML::LibXML; use strict; use warnings; my $tree = XML::LibXML->load_xml(IO => \*DATA); my $dtd = XML::LibXML::XPathContext->new( $tree->documentElement() ); $dtd->registerNs( 'u' => 'http://www.w3.org/1999/xhtml' ); for my $body ($dtd->findnodes('//u:body')) { # print $body->toString; for my $n ($body->childNodes()) { print $n->toString; } } print "\n"; print "OK\n"; exit(0); __DATA__ <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content= "HTML Tidy for HTML5 for Linux version 5.6.0" /> <title></title> </head> <body> <p>foo</p> <p>bar</p> trololo </body> </html>

        Your code as provided runs fine for me:

        $ perl 11155543.pl <p>foo</p> <p>bar</p> trololo OK $

        If that isn't what you want/expect then you will need to show what you do expect also.


        🦛

Re: Seemingly Valid HTML which crashes HTML::TreeBuilder::XPath
by Corion (Patriarch) on Nov 10, 2023 at 11:15 UTC

    I think it's not a bug in either HTML::TreeBuilder, HTML::TreeBuilder::XPath or HTML::Tidy, but more in your expectations of what ->detach_content returns, especially for nodes that only contain text, like the string trololo.

    I can't find documentation in HTML::TreeBuilder as to how it represents text nodes, but my guess is that if you receive a non-reference like the string trololo, then you should not call any methods on that.

      Thanks.

      Would there be a better way of handling the character data (CDATA) found in odd places in the HTML?

      What I am trying to do is lift the contents out of an element. Specifically, HTML Tidy adds a body element around any elements and CDATA in a document so,

      <body> <p>foo</p> <p>bar</p> </body>

      would then after processing become

      <p>foo</p> <p>bar</p>

      This is so the block which had been body can be inserted into another document, without that new document ending up with multiple body elements. One alternative would have been to change the body to div, but then multiple passes though the work flow would cause multiple, unnecessary, nested div elements. Therefore it seems like the only option is to remove the element completely and leave just its contents.