text
If you are using a handler on the la elements, that will be called as soon as the end tag for the la element is parsed. At that point, the text after that element is not yet in the document tree, so there is no way to access it.
Eg. take this XML
use warnings; use 5.014;
our $doc = q{<?xml version="1.0" encoding="iso-8859-2" ?>
<!-- Arany János: Toldi. negyedik ének, részlet. -->
<verse>
<line>Majd az édes álom pillangó képében</line>
<line><la>Elvetődött</la> arra tarka köntösében,</line>
<line>De nem <la>mert</la> szemére szállni még sokáig,</line>
<line>Szinte a pirosló hajnal hasadtáig.</line>
<line>Mert <la>félt</la> a szunyogtól, <la>félt</la> a szúrós
+nádtól,</line>
<line>Jobban a nádasnak csörtető vadától,</line>
<line><la>Félt</la> az üldözőknek távoli zajától,</line>
<line>De legis-legjobban Toldi nagy bajától.</line>
</verse>
};
(Sorry for the wavy
ő. You can't use perlmonks code tags with non-iso-8859-1 text currently.)
and see what happens if you parse it with handlers for la elements installed:
our %xmlopt = (
keep_spaces => 1, comments => "drop",
);
binmode STDOUT, ":encoding(iso-8859-2)";
if (1) {
my $n;
my $tw;
my $la_handler = sub {
my($tw1, $la) = @_;
if ($n++ < 2) {
print "In the handler for la elements. So far, the documen
+t tree contains this: (((\n" .
$tw->sprint .
"\n)))\n";
}
1;
};
$tw = XML::Twig->new(
twig_handlers => {"la" => $la_handler},
%xmlopt,
);
$tw->parse($doc);
}
=begin output
In the handler for la elements. So far, the document tree contains thi
+s: (((
<?xml version="1.0" encoding="utf-8"?>
<verse>
<line>Majd az édes álom pillangó képében</line>
<line><la>Elvetődött</la></line></verse>
)))
In the handler for la elements. So far, the document tree contains thi
+s: (((
<?xml version="1.0" encoding="utf-8"?>
<verse>
<line>Majd az édes álom pillangó képében</line>
<line><la>Elvetődött</la> arra tarka köntösében,</line>
<line>De nem <la>mert</la></line></verse>
)))
=end output
=cut
If your XML document isn't too large, then the easiest way to extract data from it is to parse it with no handlers, then find elements on it. This way, you have access to the whole document, including text after it.
Eg.
if (1) {
my $tw = XML::Twig->new(%xmlopt);
$tw->parse($doc);
for my $la ($tw->findnodes("//la")) {
my $t = $la->text;
my $ta = $la->next_sibling_text;
print "Found an la element. Its text is ((($t))). The text i
+mmediately after is (($ta))).\n";
}
}
=begin output
Found an la element. Its text is (((Elvetődött))). The text immediat
+ely after is (( arra tarka köntösében,))).
Found an la element. Its text is (((mert))). The text immediately af
+ter is (( szemére szállni még sokáig,))).
Found an la element. Its text is (((félt))). The text immediately af
+ter is (( a szunyogtól, ))).
Found an la element. Its text is (((félt))). The text immediately af
+ter is (( a szúrós nádtól,))).
Found an la element. Its text is (((Félt))). The text immediately af
+ter is (( az üldözőknek távoli zajától,))).
=end output
=cut
If your document is too large for this, it's still worth to use as few handlers as possible. For example, use a single handler for whatever element represents a whole headword entry, and in this handler, iterate on the la elements in this element. When that handler is executed, the whole entry have already been parsed, including the text after the la element, so you can access it.
if (1) {
my $line_handler = sub {
my($tw1, $li) = @_;
print "In the line handler. Full line is (((" . $li->sprint .
+ ")))\n";
for my $la ($li->findnodes("//la")) {
my $t = $la->text;
my $ta = $la->next_sibling_text;
print "Found an la element. Its text is ((($t))). The te
+xt immediately after is (($ta))).\n";
}
$tw1->purge;
};
my $tw = XML::Twig->new(
twig_handlers => {"line" => $line_handler},
%xmlopt
);
$tw->parse($doc);
}
=begin output
In the line handler. Full line is (((<line>Majd az édes álom pillangó
+ képében</line>)))
In the line handler. Full line is (((<line><la>Elvetődött</la> arra t
+arka köntösében,</line>)))
Found an la element. Its text is (((Elvetődött))). The text immediat
+ely after is (( arra tarka köntösében,))).
In the line handler. Full line is (((<line>De nem <la>mert</la> szemé
+re szállni még sokáig,</line>)))
Found an la element. Its text is (((mert))). The text immediately af
+ter is (( szemére szállni még sokáig,))).
In the line handler. Full line is (((<line>Szinte a pirosló hajnal ha
+sadtáig.</line>)))
In the line handler. Full line is (((<line>Mert <la>félt</la> a szuny
+ogtól, <la>félt</la> a szúrós nádtól,</line>)))
Found an la element. Its text is (((félt))). The text immediately af
+ter is (( a szunyogtól, ))).
Found an la element. Its text is (((félt))). The text immediately af
+ter is (( a szúrós nádtól,))).
In the line handler. Full line is (((<line>Jobban a nádasnak csörtető
+ vadától,</line>)))
In the line handler. Full line is (((<line><la>Félt</la> az üldözőkne
+k távoli zajától,</line>)))
Found an la element. Its text is (((Félt))). The text immediately af
+ter is (( az üldözőknek távoli zajától,))).
In the line handler. Full line is (((<line>De legis-legjobban Toldi n
+agy bajától.</line>)))
=end output
=cut
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.