Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

LiBXML: New markup while preserving earlier tags?

by Samantabhadra (Acolyte)
on Mar 28, 2018 at 19:16 UTC ( [id://1211941]=perlquestion: print w/replies, xml ) Need Help??

Samantabhadra has asked for the wisdom of the Perl Monks concerning the following question:

I want to enrich an xml document with new elements and preserve earlier markup.

The script below achieves what I want, but I doubt that my method is safe and proper, as I circumvent LibXML's DOM in the decisive parts of the script.

Could you guide me to a better method? Is an xpath-expression with findnodes the way out?

In the example I look for, and markup, animals and their attributes.
"quick brown fox" is in a tight tag situation and I cannot but indicate it with milestones.
"lazy dog" can be tagged regularly.

Any comment on any part of the script will be very much appreciated. Thank you.

#!/usr/bin/perl # Catch the canid in # "The quick brown fox jumps over the lazy dog." use strict; use 5.010; use XML::LibXML; use List::MoreUtils qw(uniq); my $xml = "<foo>The quick br<bar>o</bar>wn <baz>f<bar>o</bar>x</baz>"; $xml .= " jumps over the lazy d<bar>o</bar>g.</foo>"; my $new_element = "canid"; my @queried = ("quick brown fox","lazy dog"); $xml = &doubtful_method($xml,$_) foreach @queried; my $dom = XML::LibXML->load_xml( string => $xml); say $dom->to_literal; my @canids = uniq map { &catch_milestone($_) } ($dom->findnodes('//canid')); say "Canids addressed: ", join ", ", @canids; sub doubtful_method { my $xml = $_[0]; my $queried = $_[1]; my $rex = join "(<.+?>)?", split //, $queried; $xml =~ s/($rex)/&evaluate_match($1)/gem; return $xml; } sub evaluate_match { my $match = $_[0]; my $out; eval { XML::LibXML->load_xml(string => "<test>".$match."</test>") }; if (ref($@)){ my $milestone = qq|<$new_element time="| . time . qq |"/>|; $out = $milestone.$match.$milestone; } else { $out = "<$new_element>$match</$new_element>"; } return $out; } sub catch_milestone { my $node = $_[0]; my ($out); if ($node->hasAttribute("time")) { my $milestone = qq|<$new_element(.+?)?time="|; $milestone.= $node->getAttribute("time") . qq|"(.+?)?>|; my $rex = $milestone . "(?<new>.+?)" . $milestone; $out = $+{new} if $dom->documentElement->toString =~ /($rex)/; $out =~ s/<.+?>//gm; } else { $out = $node->to_literal; } return $out; }

Replies are listed 'Best First'.
Re: LiBXML: New markup while preserving earlier tags?
by choroba (Cardinal) on Mar 29, 2018 at 15:14 UTC
    This is indeed ugly.

    The "cleaner" way is to find in what text() elements and at what positions inside them you need to insert the tags, then replace each text() element with the text before the position, the tag, and the text after the position:

    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use XML::LibXML; use List::Util qw{ sum }; sub insert_tag { my ($text, $pos, $tag_name, $query) = @_; my $before = substr $text, 0, $pos; my $after = substr $text, $pos; my $parent = $text->parentNode; $parent->insertBefore('XML::LibXML::Text'->new($before), $text); $parent->insertAfter('XML::LibXML::Text'->new($after), $text); my $tag = 'XML::LibXML::Element'->new($tag_name); $parent->replaceChild($tag, $text); $tag->{query} = $query; } my $xml = "<foo>The quick br<bar>o</bar>wn <baz>f<bar>o</bar>x</baz>"; $xml .= " jumps over the lazy d<bar>o</bar>g.</foo>"; my $new_element = "canid"; my @queried = ("lazy dog", "quick brown fox",); my $dom = 'XML::LibXML'->load_xml(string => $xml); for my $query (@queried) { my @texts = $dom->findnodes('//text()'); my ($from, $to) = (0, 0); my $found; OUTER: while ($to <= $#texts) { my $subtext = join "", @texts[ $from .. $to ]; for my $length (1 .. length $query) { my $subquery = substr $query, 0, $length; $found = index $subtext, $subquery; ++$to, next OUTER if -1 != $found && $length < length $query && $length == length($subtext) - $found; $to = ++$from, next OUTER if -1 == $found; } my $subtext_length = sum(map length, @texts[ $from .. $to ]); my $last_pos = length($texts[$to]) - ($subtext_length - $found + - length $query); insert_tag($texts[$to], $last_pos, 'end', $query); insert_tag($texts[$from], $found, 'start', $query); last OUTER; } } print $dom;
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Thank you very much indeed, choroba, for this answer. It took me a while to get back to it and I learned a lot from analyzing your ingenious approach, although the core techniques your are using are still beyond my grasp.

      That is why I am not able to mend an unexpected outcome of your script: If more than one matches are found, only one is tagged, and in part only.

      If for example @queried = ("he"), the last occurence in the source string (-> tHE lazy dog) is not tagged, and the first occurence (-> tHE ... fox) is only furnished with the closing demarcation "<end query="he"/>".

      I keep on trying, but if the necessary correction of the script pops into your mind immediately, I would be grateful for a short hint, i.e. more indebted than I am already are.

        Congratulations, you've found a bug!

        The problem is that if the start and end tags both belong to the same text() node, insertion of the end tag creates a new text() nodes that replace the old one, so inserting the start tag into the old text() node doesn't insert it to the newly created one. Easily fixed by the following patch:

        @@ -12,18 +12,20 @@ my $after = substr $text, $pos; my $parent = $text->parentNode; - $parent->insertBefore('XML::LibXML::Text'->new($before), $text); + my $preceding + = $parent->insertBefore('XML::LibXML::Text'->new($before), $t +ext); $parent->insertAfter('XML::LibXML::Text'->new($after), $text); my $tag = 'XML::LibXML::Element'->new($tag_name); $parent->replaceChild($tag, $text); $tag->{query} = $query; + return $preceding } my $xml = "<foo>The quick br<bar>o</bar>wn <baz>f<bar>o</bar>x</baz>" +; $xml .= " jumps over the lazy d<bar>o</bar>g.</foo>"; my $new_element = "canid"; -my @queried = ("lazy dog", "quick brown fox",); +my @queried = ("lazy dog", "quick brown fox", "the"); my $dom = 'XML::LibXML'->load_xml(string => $xml); @@ -48,8 +50,10 @@ my $subtext_length = sum(map length, @texts[ $from .. $to ]); my $last_pos = length($texts[$to]) - ($subtext_length - $foun +d - length $query); - insert_tag($texts[$to], $last_pos, 'end', $query); - insert_tag($texts[$from], $found, 'start', $query); + my $preceding = insert_tag($texts[$to], $last_pos, 'end', $qu +ery); + + my $start_text = $from == $to ? $preceding : $texts[$from]; + insert_tag($start_text, $found, 'start', $query); last OUTER; }

        i.e. the insert_tag subroutine returns the newly created text() node preceding the tag, and it's used as the target text() when $from == $to, i.e. when both the elements belong to the same text().

        The code as written doesn't handle multiple occurrences of the query. Again, the fix is easy:

        @@ -25,7 +25,7 @@ $xml .= " jumps over the lazy d<bar>o</bar>g.</foo>"; my $new_element = "canid"; -my @queried = ("lazy dog", "quick brown fox", "the"); +my @queried = ("lazy dog", "quick brown fox", "he", "e"); my $dom = 'XML::LibXML'->load_xml(string => $xml); @@ -55,7 +55,10 @@ my $start_text = $from == $to ? $preceding : $texts[$from]; insert_tag($start_text, $found, 'start', $query); - last OUTER; + @texts = $dom->findnodes('//text()'); + $from += $from == $to ? 1 : 2; + + last OUTER if $from > @texts; } } print $dom;

        i.e. after the replacement, reload the text() nodes to search (this could probably be optimized*) to only replace the split one by the ones it's been split to), and start searching from the text where the end tag was inserted (when both the tags were inserted into the same text() node, the node was split into three text() nodes, if they belong to different text() nodes, each of them was split into two nodes, so there are four new nodes).

        *) Update: Here's the optimization:

        @@ -14,11 +14,12 @@ my $parent = $text->parentNode; my $preceding = $parent->insertBefore('XML::LibXML::Text'->new($before), $t +ext); - $parent->insertAfter('XML::LibXML::Text'->new($after), $text); + my $following + = $parent->insertAfter('XML::LibXML::Text'->new($after), $tex +t); my $tag = 'XML::LibXML::Element'->new($tag_name); $parent->replaceChild($tag, $text); $tag->{query} = $query; - return $preceding + return $preceding, $following } my $xml = "<foo>The quick br<bar>o</bar>wn <baz>f<bar>o</bar>x</baz>" +; @@ -50,12 +51,14 @@ my $subtext_length = sum(map length, @texts[ $from .. $to ]); my $last_pos = length($texts[$to]) - ($subtext_length - $foun +d - length $query); - my $preceding = insert_tag($texts[$to], $last_pos, 'end', $qu +ery); + my @new_texts = insert_tag($texts[$to], $last_pos, 'end', $qu +ery); + + splice @texts, $to, 1, @new_texts; + + my $start_text = $from == $to ? $new_texts[0] : $texts[$from] +; - my $start_text = $from == $to ? $preceding : $texts[$from]; insert_tag($start_text, $found, 'start', $query); - @texts = $dom->findnodes('//text()'); $from += $from == $to ? 1 : 2; last OUTER if $from > @texts;

        Note that it's not needed to splice the texts around the start tag, because that part has already been searched for the current query.

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: LiBXML: New markup while preserving earlier tags?
by Anonymous Monk on Mar 29, 2018 at 01:13 UTC
    What immediately pops into my mind is that this is exactly the sort of thing that you can do with XSLT, which LibXML also fully supports. And in any case, no, you should not "circumvent" anything. Do your transformations within the DOM. Or, consider an alternate package such as XML::Twig which takes a fundamentally different approach to the problem.

      Ah, you got one under the radar; garnered some upvotes or else I'd let it go. Shame on both of us. And me without my coffee.

      What immediately pops into my mind, is a code phrase for you. If you want to be stealthy, try indicating that you perhaps thought about a problem instead of coughed up whatever hair ball of buzzwords wouldn't pass. You've been recommending XSLT lately. Don't do it unless you're ready to show code. It's difficult and unpleasant and recommending it to beginners without any support is uncool. You also recommended XML::Twig without code. Unhelpful.

      Look at the effort between your reply and choroba's reply. The care. The knowledge. The willingness to spend some time understanding a question and really try to help.

      libxml does not "support" libxslt. They are separate packages. A perusal of the XML::LibXML distribution shows not one XSLT package included either. There is a separate distribution for that; XML::LibXSLT.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1211941]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (6)
As of 2024-03-28 18:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found