http://qs321.pair.com?node_id=936813

mertserger has asked for the wisdom of the Perl Monks concerning the following question:

As I have posted before, I help maintain a set of Perl scripts which use XML::Twig to run validation checks on dictionary entries written in XML. The checks are looking for things that a DTD would not spot.

In Node XML::Twig prev_sibling I asked for help with a piece of code which decided whether a sense in the entry was to be considered rare or not.

Now I have a further problem relating to this issue: there is another piece of text which might mean that something labelled "rare" in an la element is actually not rare:
<la>rare</la> before the 20th century.
<la>rare</la> after the 18th century.
In both these cases, for checking purposes the sense or entry should be treated as "not rare"

The piece of code I posted in the previous question I have been able to modify to handle this as it is handling the parent element of the <la> element. This code is called at the sense level within the XML.

Howevere there is other fairly similar code to handle things for the entry as a whole:

sub la_obs_handler { my ($t,$elt) = @_; my $la = $elt->text; my $isNow = 0; if ( ($prev && $prev->text =~ m/[nN]ow $/) || $elt->parent->text =~ m/[nN]ow .* and rare/ || $elt->parent->text =~ m/rare after/ || $elt->parent->text =~ m/rare before/ ) { $isNow = 1; } if ( $la eq "rare" && ! $isNow ) { $is_entry_rare = 1; } }

This should set the variable $is_entry_rare to 1 if the <la> element being handled contains the text "rare" but not if it is also preceded by "Now" or "Now <something> and". This works. However it should also not set the varaibale to 1 if the la element is followed by sibling text saying "after ...." of "before ...." I have tried testing the parent node's text content as shown in my code but it does not seem to work - I think this is because XML::Twig has not accessed anything following the <la> element since it is handling that element.

Is there any way of getting at the text following the <la> element? I know I could do this by handling the parent element instead but I don't think that is an option with my legacy code. Are any other solutuions possible?

Replies are listed 'Best First'.
Re: XML:: Twig - can you check for text following the element being handled?
by ambrus (Abbot) on Nov 08, 2011 at 17:09 UTC
    text

    If you are using a handler on the la elements, that will be called as soon as the end tag for the la element is parsed. At that point, the text after that element is not yet in the document tree, so there is no way to access it.

    If your XML document isn't too large, then the easiest way to extract data from it is to parse it with no handlers, then find elements on it. This way, you have access to the whole document, including text after it.

    If your document is too large for this, it's still worth to use as few handlers as possible. For example, use a single handler for whatever element represents a whole headword entry, and in this handler, iterate on the la elements in this element. When that handler is executed, the whole entry have already been parsed, including the text after the la element, so you can access it.

      Thanks Ambrus, I thought that was the case but I wanted to make sure I hadn't overlooked anything.

      As I said, this is a legacy script I have inherited. Luckily the particular form of data does not occur often, so the user has agreed it is not worth a major rewrite of the code to accomodate it. As it is, it means occassionaly the validation script will raise a few warnings where actually the data is OK, but we can live with that.

Re: XML:: Twig - can you check for text following the element being handled?
by Anonymous Monk on Nov 08, 2011 at 16:19 UTC

      I think this is working as designed, la handler is called before its parent snot , before siblings are parsed and added to parent, so naturally next_sibling_text returns empty string until the parent snot handler is called

      So the best and simples solution is to do this testing in a handler for the parent

      #!/usr/bin/perl -- use strict; use warnings; use XML::Twig; my $xml = <<'__XML__'; <?xml version="1.0" encoding="UTF-8"?> <root> <snot>the <la>snot</la> balls are made of snot </snot> <snot>the <la>snot</la> bells are made of snot </snot> <snot>the <la>snot</la> bowls are made of snot </snot> </root> __XML__ #~ Handlers are triggered in fixed order, sorted by their type #~ (xpath expressions first, then regexps, then level), then by #~ whether they specify a full path (starting at the root element) #~ or not, then by by number of steps in the expression , then #~ number of predicates, then number of tests in predicates. #~ Handlers where the last step does not specify a step #~ ("foo/bar/*") are triggered after other XPath handlers. Finally #~ "_all_" handlers are triggered last. { my @snot; my $t = XML::Twig->new( twig_handlers => { 'snot' => sub { warn $_->path, "\n"; push @snot, $_->text; return !!1; }, ## la , triggered before snot 'la' => sub { warn $_->path, "\n"; push @snot, [ $_->text , ## doesn't contain next_sibling_text because not parsed yet, as expect +ed $_->parent->text , ]; return !!1; }, }, ); $t->parse($xml); undef $t; use Data::Dumper(); print Data::Dumper->new([ \@snot ])->Indent(1)->Dump; } __END__ /root/snot/la /root/snot /root/snot/la /root/snot /root/snot/la /root/snot $VAR1 = [ [ 'snot', 'the snot' ], 'the snot balls are made of snot ', [ 'snot', 'the snot' ], 'the snot bells are made of snot ', [ 'snot', 'the snot' ], 'the snot bowls are made of snot ' ];