Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^3: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?

by haukex (Archbishop)
on Aug 09, 2020 at 08:27 UTC ( [id://11120517]=note: print w/replies, xml ) Need Help??


in reply to Re^2: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?
in thread XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?

Actually, I *do* believe that the text "World" belongs to the b element and, therefore, not to the p element.

Sure, that's up to you. I can't speak to how other modules implemented it, but I'd refer you to the libxml2 documentation, and the Document Object Model Specification for all the "official" details.

Anyway, I described two ways you can get the text nodes of the current node. Using the XPath expression I showed is probably easiest. I can't really say more since you haven't described what it is you're trying to do with the document.

use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML->load_xml( string => <<'EOT' ); <html> <head> <title>Title_Text</title> </head> <body> <p>paragraph_text</p> <div> <div> innnermost_text </div> </div> </body> </html> EOT for my $node ($doc->findnodes('//*')) { print "<<<", $node->nodeName, ">>>\n"; my @texts = map { $_->data } $node->findnodes('./text()'); use Data::Dump; dd @texts; # Debug } __END__ <<<html>>> (" \n ", " \n ", " ") <<<head>>> (" ", " ") <<<title>>> "Title_Text" <<<body>>> (" \n ", "\n ", " \n ") <<<p>>> "paragraph_text" <<<div>>> (" \n ", "\n ") <<<div>>> " \n innnermost_text\n "

You could also use XML::LibXML::SAX to get an event-based parser.

  • Comment on Re^3: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?
  • Download Code

Replies are listed 'Best First'.
Re^4: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?
by bobn (Chaplain) on Aug 10, 2020 at 06:01 UTC

    $node->string_value();
    Note this method is undocumented (there's a method with that name in XML::LibXML::NodeList, but your $nodes are XML::LibXML::Elements), you should use textContent instead.
    Yes, that's true. I can't reconstruct exactly what happened when I made this code, I got into the documentation for an apparently unrelated module, where string_value was documented. I'm tempted to erase the whole thing.

    However, this code of yours: my @texts = map { $_->data } node->findnodes('./text()');

    actually shows *exactly* what I'm talking about: the "innermost_text" is ONLY appearing in the output for it's innermost containing element, which is the last <div> element/node/whatever that you found with $doc->findnodes('//*'). It's not in every element that it is inside of, like <body> or <html> That's what I was looking for! Thank you!!!

    What I was working on: I've been doing some Python XHTML parsing, and over there, it was talking about "tail text". It's really weird - it says that text that follows an element's closing tag belongs to *that* element as "tail text" - NOT to the element that it is inside of. If you care, go to https://lxml.de/tutorial.html and search on "document-style". Anyhow, I was testing in Perl to see if it had anything like that, which I don't see.

    As far as using SAX parsers, I've used somewhat similar - HTML:: Parser or XML::Parser are similar, I think, you create callbacks for events that happen during parsing. Having discovered XPath, the event-driven parser now seems to me like a crude, primitive approach. I'm sure there are still places it applies.

    --Bob Niederman,

    All code given here is UNTESTED unless otherwise stated.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11120517]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2024-04-25 13:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found