Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

XML::Twig handlers weirdness?

by gjb (Vicar)
on Feb 20, 2006 at 15:09 UTC ( #531435=perlquestion: print w/replies, xml ) Need Help??

gjb has asked for the wisdom of the Perl Monks concerning the following question:

Wise Monks, I've to turn to you for a piece of advise on the following problem.

I'm using XML::Twig to parse an XML file. The output should simply be the path of each element in the DOM tree. I've written a handler that is associated to all start tags and that does precisely that. Since I don't want the leading '/', I strip it using substr. No problem so far. However, I also want to have the XML tags in lowercase and now things start to get interesting.

I've included two Perl programs, one that parses an actual XML file, the other simulating the behavior of the handler on ordinary text data to try and isolate the problem. The output of the latter seems fine, while the output of the former is clearly incorrect.

#!/usr/bin/perl use strict; use warnings; use XML::Twig; my $twig = XML::Twig->new( twig_handlers => {'_all_' => \&start_tag} ); $twig->parse(*DATA); sub start_tag { my ($t, $e) = @_; my $str = $e->path(); print substr(lc($str), 1), "\n"; print lc(substr($str, 1)), "\n\n"; } __DATA__ <A> <a> <B>blah blah</B> <b>blah blah blah</b> <b>blah <a/> blah</b> </a> <b/> </A>
The output produced is:
a/a/b a/a/b a/a/b a/a/b a/a/b a/a/b/a a/a/b a/a/b a/a a/a a/b a/b a a
Note the third group which doesn't yield the expected output. Below is the attempt to reproduce this outside the context of XML parsing:
#!/usr/bin/perl use strict; use warnings; while (<DATA>) { chomp($_); print_str($_); } sub print_str { my ($str) = @_; print substr(lc($str), 1), "\n"; print lc(substr($str, 1)), "\n\n"; } __DATA__ /A/a/B /A/a/b /A/a/b/a /A/a/b /A/a /A/b /A
which produces the expected results below:
a/a/b a/a/b a/a/b a/a/b a/a/b/a a/a/b/a a/a/b a/a/b a/a a/a a/b a/b a a

It would seem that within the XML handler something very weird happens, as if a variable with a fixed length (that which it has in the first invocation) is reused between calls to the handler.

I'd be grateful if someone could shed some light on this. Thanks in advance, -gjb-

Update: given that this seems to be a version specific issue, I should mention the results above have been obtained using XML::Twig 3.23 (i.e. the latest version) on Perl 5.8.7 built for cygwin-thread-multi-64int (i.e. the standard version that can be installed using Cygwin's installer).

Replies are listed 'Best First'.
Re: XML::Twig handlers weirdness?
by mirod (Canon) on Feb 20, 2006 at 15:51 UTC


    • the bug shows up in 5.8.8 on my machine (linux)
    • the code runs properly in 5.8.0,
    • in all cases the path itself is correct (if you print it, as are lc($str) and substr( $str, 1)
    • you need a fairly specific test case (you can remove the 'B' element or the 'b' and you still get the bug, but if you remove both then you get the proper output,
    • if you only print the data for the problematic element... everything is fine.

    This looks extremely weird. I can only guess that the usual suspect, unicode, is involved... but how?

    Any help on this one would be appreciated.

Re: XML::Twig handlers weirdness?
by acid06 (Friar) on Feb 21, 2006 at 00:39 UTC
    The bug also happens here running under Win32, ActivePerl 5.8.7 (build 815). XML::Twig version 3.21.

    So, I tried updating XML::Twig to the newest version (3.23) and the bug's still present.

    I guess you should report this bug.

    perl -e "print pack('h*', 16369646), scalar reverse $="
Re: XML::Twig handlers weirdness?
by Corion (Patriarch) on Feb 20, 2006 at 16:12 UTC

    As another data point, I get the same on This is perl, v5.8.2 built for MSWin32-x86-multi-thread:

    Q:\>perl -w a/a/b a/a/b a/a/b a/a/b a/a/b a/a/b/a a/a/b a/a/b a/a a/a a/b a/b a a

    Trying to force a copy by adding $str = "$str" . "" or other variations didn't prove fruitful...

Re: XML::Twig handlers weirdness?
by muntfish (Chaplain) on Feb 20, 2006 at 15:21 UTC

    This probably isn't all that much help to you, but when I run your first sample code I get the expected output:

    a/a/b a/a/b a/a/b a/a/b a/a/b/a a/a/b/a a/a/b a/a/b a/a a/a a/b a/b a a

    Perl 5.8.0 on HP-UX; XML::Twig v3.15.

Re: XML::Twig handlers weirdness?
by benizi (Hermit) on Feb 21, 2006 at 18:59 UTC

    I call perl bug. The following also demonstrates the odd behavior. Interestingly, the substr(lc($str),0) in the second iteration is limited to the length of the (correct) substr(lc($str),0) in the first iteration.

    e.g. with an argument of 'a:bc', the 'bc' is cut down to 'b' (the length of 'a'). For 'ab:cde' or 'ab:cdefgh', the 'cde' and 'cdefgh' are cut down to 'cd' (the length of 'ab').

    #!/usr/bin/perl -l use strict; use warnings; use Encode qw/_utf8_on/; for my $str (split /:/, shift||'a:bc') { _utf8_on($str); print "$str\t", substr(lc($str), 0); # use Devel::Peek; Dump substr(lc($str),0); }

    For someone familiar w/ perlguts (not me), uncomment the Devel::Peek line.

    UPDATE: Expected output for input of x:yz is:

    x x yz yz
    , but due to bugginess, it's:
    x x yz y

    Also, the problem presents in v5.8.7 linux, but not in v5.8.0 solaris, if those are helpful data points.


      So indeed it looks like something linked to unicode. The strings that compose the path in XML::Twig come directly from XML::Parser, so they have been utf-8'ed somewhere in expat or XML::Parser, hence the bug shows its ugly head. It's weird to get problems with basic ascii characters though.

      Incidently 5.8.0 and 5.8.1-8 are fairly different in their unicode support, so I am not surprised that they behave differently.

      In any case, I think I'm off the hook for this one, so thanks! :--)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://531435]
Approved by Corion
Front-paged by planetscape
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (7)
As of 2023-12-06 17:32 GMT
Find Nodes?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?

    Results (31 votes). Check out past polls.