Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^2: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?

by bobn (Chaplain)
on Aug 09, 2020 at 00:11 UTC ( [id://11120507]=note: print w/replies, xml ) Need Help??


in reply to Re: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?
in thread XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?

Consider this piece of HTML: <p>Hello, <b>World</b>!</p> - the <p> element has three child nodes: a text node ("Hello, "), the <b> element, and another text node ("!"). But if you ask the question "what text is contained in that paragraph?" then wouldn't the natural answer be "Hello, World!" instead of "Hello, !"?
Actually, I *do* believe that the text "World" belongs to the b element and, therefore, not to the p element. Printing out specifically the p element's text should not include it. Sure as hell, printing the text for <html> should not print out all the text of all the descendent nodes.

I've tried 3 other pieces of code - HTML::Parser and, in Python lxml.etree (bindings to libxml2, as is XML::LibXML) and xml.parsers.expat, comparable to HTML::Parser. They all agree that text belongs to the innermost containing element, and no other. (Well except lxml.etree, which thinks that elements mixed in with the text of a parent element somehow suck up the text after them in something known as "tail text" - I never heard of it before and it's really hard to find anything about it on the internet that *isn't* associated with lxml and Python. I think they just made that crap up.) So that's where I am, 3 other pieces of software disagree with this one - and I can't see that I've done anything incorrectly.

--Bob Niederman, http://bob-n.com

All code given here is UNTESTED unless otherwise stated.

Replies are listed 'Best First'.
Re^3: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?
by haukex (Archbishop) on Aug 09, 2020 at 08:27 UTC
    Actually, I *do* believe that the text "World" belongs to the b element and, therefore, not to the p element.

    Sure, that's up to you. I can't speak to how other modules implemented it, but I'd refer you to the libxml2 documentation, and the Document Object Model Specification for all the "official" details.

    Anyway, I described two ways you can get the text nodes of the current node. Using the XPath expression I showed is probably easiest. I can't really say more since you haven't described what it is you're trying to do with the document.

    use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML->load_xml( string => <<'EOT' ); <html> <head> <title>Title_Text</title> </head> <body> <p>paragraph_text</p> <div> <div> innnermost_text </div> </div> </body> </html> EOT for my $node ($doc->findnodes('//*')) { print "<<<", $node->nodeName, ">>>\n"; my @texts = map { $_->data } $node->findnodes('./text()'); use Data::Dump; dd @texts; # Debug } __END__ <<<html>>> (" \n ", " \n ", " ") <<<head>>> (" ", " ") <<<title>>> "Title_Text" <<<body>>> (" \n ", "\n ", " \n ") <<<p>>> "paragraph_text" <<<div>>> (" \n ", "\n ") <<<div>>> " \n innnermost_text\n "

    You could also use XML::LibXML::SAX to get an event-based parser.

      $node->string_value();
      Note this method is undocumented (there's a method with that name in XML::LibXML::NodeList, but your $nodes are XML::LibXML::Elements), you should use textContent instead.
      Yes, that's true. I can't reconstruct exactly what happened when I made this code, I got into the documentation for an apparently unrelated module, where string_value was documented. I'm tempted to erase the whole thing.

      However, this code of yours: my @texts = map { $_->data } node->findnodes('./text()');

      actually shows *exactly* what I'm talking about: the "innermost_text" is ONLY appearing in the output for it's innermost containing element, which is the last <div> element/node/whatever that you found with $doc->findnodes('//*'). It's not in every element that it is inside of, like <body> or <html> That's what I was looking for! Thank you!!!

      What I was working on: I've been doing some Python XHTML parsing, and over there, it was talking about "tail text". It's really weird - it says that text that follows an element's closing tag belongs to *that* element as "tail text" - NOT to the element that it is inside of. If you care, go to https://lxml.de/tutorial.html and search on "document-style". Anyhow, I was testing in Perl to see if it had anything like that, which I don't see.

      As far as using SAX parsers, I've used somewhat similar - HTML:: Parser or XML::Parser are similar, I think, you create callbacks for events that happen during parsing. Having discovered XPath, the event-driven parser now seems to me like a crude, primitive approach. I'm sure there are still places it applies.

      --Bob Niederman,

      All code given here is UNTESTED unless otherwise stated.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11120507]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2024-03-28 16:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found