Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?

by haukex (Archbishop)
on Aug 08, 2020 at 07:25 UTC ( [id://11120496]=note: print w/replies, xml ) Need Help??


in reply to XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?

I get a nodeset, start walking through it and getting text out, but when it comes out, for each node I get the text contained in node element AND the text of all of it's descendants (contained elements).

This makes sense to me. Consider this piece of HTML: <p>Hello, <b>World</b>!</p> - the <p> element has three child nodes: a text node ("Hello, "), the <b> element, and another text node ("!"). But if you ask the question "what text is contained in that paragraph?" then wouldn't the natural answer be "Hello, World!" instead of "Hello, !"? Otherwise, what question are you asking? If it's "what are the children of this <p> element that are text nodes", you'll have to code that explicitly, and you may get any number of text nodes (in the aforementioned example, it's two, but consider that any whitespace like newlines and indentation are text nodes too, e.g. the <body> in your example has three text children, all whitespace). Two ways to do that are to iterate over the childNodes of a node, checking their nodeType for XML_TEXT_NODE and XML_CDATA_SECTION_NODE. Or, use an XPath expression like '//p/child::text()'. OTOH, event-based parsers will return nodes as they encounter them. Perhaps you could explain what you're trying to do and what your expected output is?

$node->string_value();

Note this method is undocumented (there's a method with that name in XML::LibXML::NodeList, but your $nodes are XML::LibXML::Elements), you should use textContent instead.

Note that you don't need XML::LibXML::XPathContext unless the document you're parsing contains namespaces; the regular XML::LibXML::Node has a findnodes too.

Minor edits.

  • Comment on Re: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?
by bobn (Chaplain) on Aug 09, 2020 at 00:11 UTC

    Consider this piece of HTML: <p>Hello, <b>World</b>!</p> - the <p> element has three child nodes: a text node ("Hello, "), the <b> element, and another text node ("!"). But if you ask the question "what text is contained in that paragraph?" then wouldn't the natural answer be "Hello, World!" instead of "Hello, !"?
    Actually, I *do* believe that the text "World" belongs to the b element and, therefore, not to the p element. Printing out specifically the p element's text should not include it. Sure as hell, printing the text for <html> should not print out all the text of all the descendent nodes.

    I've tried 3 other pieces of code - HTML::Parser and, in Python lxml.etree (bindings to libxml2, as is XML::LibXML) and xml.parsers.expat, comparable to HTML::Parser. They all agree that text belongs to the innermost containing element, and no other. (Well except lxml.etree, which thinks that elements mixed in with the text of a parent element somehow suck up the text after them in something known as "tail text" - I never heard of it before and it's really hard to find anything about it on the internet that *isn't* associated with lxml and Python. I think they just made that crap up.) So that's where I am, 3 other pieces of software disagree with this one - and I can't see that I've done anything incorrectly.

    --Bob Niederman, http://bob-n.com

    All code given here is UNTESTED unless otherwise stated.

      Actually, I *do* believe that the text "World" belongs to the b element and, therefore, not to the p element.

      Sure, that's up to you. I can't speak to how other modules implemented it, but I'd refer you to the libxml2 documentation, and the Document Object Model Specification for all the "official" details.

      Anyway, I described two ways you can get the text nodes of the current node. Using the XPath expression I showed is probably easiest. I can't really say more since you haven't described what it is you're trying to do with the document.

      use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML->load_xml( string => <<'EOT' ); <html> <head> <title>Title_Text</title> </head> <body> <p>paragraph_text</p> <div> <div> innnermost_text </div> </div> </body> </html> EOT for my $node ($doc->findnodes('//*')) { print "<<<", $node->nodeName, ">>>\n"; my @texts = map { $_->data } $node->findnodes('./text()'); use Data::Dump; dd @texts; # Debug } __END__ <<<html>>> (" \n ", " \n ", " ") <<<head>>> (" ", " ") <<<title>>> "Title_Text" <<<body>>> (" \n ", "\n ", " \n ") <<<p>>> "paragraph_text" <<<div>>> (" \n ", "\n ") <<<div>>> " \n innnermost_text\n "

      You could also use XML::LibXML::SAX to get an event-based parser.

        $node->string_value();
        Note this method is undocumented (there's a method with that name in XML::LibXML::NodeList, but your $nodes are XML::LibXML::Elements), you should use textContent instead.
        Yes, that's true. I can't reconstruct exactly what happened when I made this code, I got into the documentation for an apparently unrelated module, where string_value was documented. I'm tempted to erase the whole thing.

        However, this code of yours: my @texts = map { $_->data } node->findnodes('./text()');

        actually shows *exactly* what I'm talking about: the "innermost_text" is ONLY appearing in the output for it's innermost containing element, which is the last <div> element/node/whatever that you found with $doc->findnodes('//*'). It's not in every element that it is inside of, like <body> or <html> That's what I was looking for! Thank you!!!

        What I was working on: I've been doing some Python XHTML parsing, and over there, it was talking about "tail text". It's really weird - it says that text that follows an element's closing tag belongs to *that* element as "tail text" - NOT to the element that it is inside of. If you care, go to https://lxml.de/tutorial.html and search on "document-style". Anyhow, I was testing in Perl to see if it had anything like that, which I don't see.

        As far as using SAX parsers, I've used somewhat similar - HTML:: Parser or XML::Parser are similar, I think, you create callbacks for events that happen during parsing. Having discovered XPath, the event-driven parser now seems to me like a crude, primitive approach. I'm sure there are still places it applies.

        --Bob Niederman,

        All code given here is UNTESTED unless otherwise stated.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11120496]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2024-04-26 04:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found