Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
text

If you are using a handler on the la elements, that will be called as soon as the end tag for the la element is parsed. At that point, the text after that element is not yet in the document tree, so there is no way to access it.

Eg. take this XML
use warnings; use 5.014; our $doc = q{<?xml version="1.0" encoding="iso-8859-2" ?> <!-- Arany János: Toldi. negyedik ének, részlet. --> <verse> <line>Majd az édes álom pillangó képében</line> <line><la>Elvetődött</la> arra tarka köntösében,</line> <line>De nem <la>mert</la> szemére szállni még sokáig,</line> <line>Szinte a pirosló hajnal hasadtáig.</line> <line>Mert <la>félt</la> a szunyogtól, <la>félt</la> a szúrós +nádtól,</line> <line>Jobban a nádasnak csörtető vadától,</line> <line><la>Félt</la> az üldözőknek távoli zajától,</line> <line>De legis-legjobban Toldi nagy bajától.</line> </verse> };
(Sorry for the wavy ő. You can't use perlmonks code tags with non-iso-8859-1 text currently.)

and see what happens if you parse it with handlers for la elements installed:

our %xmlopt = ( keep_spaces => 1, comments => "drop", ); binmode STDOUT, ":encoding(iso-8859-2)"; if (1) { my $n; my $tw; my $la_handler = sub { my($tw1, $la) = @_; if ($n++ < 2) { print "In the handler for la elements. So far, the documen +t tree contains this: (((\n" . $tw->sprint . "\n)))\n"; } 1; }; $tw = XML::Twig->new( twig_handlers => {"la" => $la_handler}, %xmlopt, ); $tw->parse($doc); }
=begin output In the handler for la elements. So far, the document tree contains thi +s: ((( <?xml version="1.0" encoding="utf-8"?> <verse> <line>Majd az édes álom pillangó képében</line> <line><la>Elvetődött</la></line></verse> ))) In the handler for la elements. So far, the document tree contains thi +s: ((( <?xml version="1.0" encoding="utf-8"?> <verse> <line>Majd az édes álom pillangó képében</line> <line><la>Elvetődött</la> arra tarka köntösében,</line> <line>De nem <la>mert</la></line></verse> ))) =end output =cut

If your XML document isn't too large, then the easiest way to extract data from it is to parse it with no handlers, then find elements on it. This way, you have access to the whole document, including text after it.

Eg.

if (1) { my $tw = XML::Twig->new(%xmlopt); $tw->parse($doc); for my $la ($tw->findnodes("//la")) { my $t = $la->text; my $ta = $la->next_sibling_text; print "Found an la element. Its text is ((($t))). The text i +mmediately after is (($ta))).\n"; } }
=begin output Found an la element. Its text is (((Elvetődött))). The text immediat +ely after is (( arra tarka köntösében,))). Found an la element. Its text is (((mert))). The text immediately af +ter is (( szemére szállni még sokáig,))). Found an la element. Its text is (((félt))). The text immediately af +ter is (( a szunyogtól, ))). Found an la element. Its text is (((félt))). The text immediately af +ter is (( a szúrós nádtól,))). Found an la element. Its text is (((Félt))). The text immediately af +ter is (( az üldözőknek távoli zajától,))). =end output =cut

If your document is too large for this, it's still worth to use as few handlers as possible. For example, use a single handler for whatever element represents a whole headword entry, and in this handler, iterate on the la elements in this element. When that handler is executed, the whole entry have already been parsed, including the text after the la element, so you can access it.

if (1) { my $line_handler = sub { my($tw1, $li) = @_; print "In the line handler. Full line is (((" . $li->sprint . + ")))\n"; for my $la ($li->findnodes("//la")) { my $t = $la->text; my $ta = $la->next_sibling_text; print "Found an la element. Its text is ((($t))). The te +xt immediately after is (($ta))).\n"; } $tw1->purge; }; my $tw = XML::Twig->new( twig_handlers => {"line" => $line_handler}, %xmlopt ); $tw->parse($doc); }
=begin output In the line handler. Full line is (((<line>Majd az édes álom pillangó + képében</line>))) In the line handler. Full line is (((<line><la>Elvetődött</la> arra t +arka köntösében,</line>))) Found an la element. Its text is (((Elvetődött))). The text immediat +ely after is (( arra tarka köntösében,))). In the line handler. Full line is (((<line>De nem <la>mert</la> szemé +re szállni még sokáig,</line>))) Found an la element. Its text is (((mert))). The text immediately af +ter is (( szemére szállni még sokáig,))). In the line handler. Full line is (((<line>Szinte a pirosló hajnal ha +sadtáig.</line>))) In the line handler. Full line is (((<line>Mert <la>félt</la> a szuny +ogtól, <la>félt</la> a szúrós nádtól,</line>))) Found an la element. Its text is (((félt))). The text immediately af +ter is (( a szunyogtól, ))). Found an la element. Its text is (((félt))). The text immediately af +ter is (( a szúrós nádtól,))). In the line handler. Full line is (((<line>Jobban a nádasnak csörtető + vadától,</line>))) In the line handler. Full line is (((<line><la>Félt</la> az üldözőkne +k távoli zajától,</line>))) Found an la element. Its text is (((Félt))). The text immediately af +ter is (( az üldözőknek távoli zajától,))). In the line handler. Full line is (((<line>De legis-legjobban Toldi n +agy bajától.</line>))) =end output =cut

In reply to Re: XML:: Twig - can you check for text following the element being handled? by ambrus
in thread XML:: Twig - can you check for text following the element being handled? by mertserger

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2024-03-29 08:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found