Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

XML::Twig and whitespaces

by DJpumps (Novice)
on Aug 23, 2007 at 14:28 UTC ( [id://634644]=perlquestion: print w/replies, xml ) Need Help??

DJpumps has asked for the wisdom of the Perl Monks concerning the following question:

I'm using XML::Twig and have noticed that it replaces white-space characters to spaces. Something like s/\s/ /g on all attribute and element values.

See example code:

#!/usr/bin/perl use strict; use warnings; use XML::Twig; my $twig=XML::Twig->new(); my $xml = qq{<root t="space tab\tnewline\nend"/>}; $twig->xparse($xml); my $text=$twig->root->att('t'); print "|$text|\n";
The expected output (contents of the scalar $text is the word space followed by a single space character, followed by the word tab, followed by a single tab character, followed by the word newline, followed by a single newline character, and then the word end.

Instead, the tab and the newline characters were replaced by a single space character each.

  1. Why?
  2. How can I force XML::Twig to preserve whitespaces and not alter the text in any form?

This relates to XML::Twig 3.29 and XML::Twig 3.30. I did not check this behavior in other versions of XML::Twig

Thanks.

-- DJpumps

Replies are listed 'Best First'.
Re: XML::Twig and whitespaces
by mirod (Canon) on Aug 23, 2007 at 15:06 UTC

    In attributes, the behaviour shown is normal, actually, it is required by the XML spec, see Attribute Value Normalization.

    I don't think the module does that in elements, except that it discards line returns followed by spaces between 2 tags (getting rid of non-significant whitespaces, as far as it can tell). you can turn this off using the keep_spaces option when you create the twig.

    You could not-normalize attribute values by using the keep_encoding method and writing your own start tag parser (based on XML::Twig's own parser in _parse_start_tag) and using it through the parse_start_tag option. Not really simple, but you are trying to do non XML processing with an XML processor here.

      Hello, midod, and thanks for your quick reply. While you are right (and I was wrong) with regards to Attribute Value Normalization (see the correct updated reference from XML 1.0 4th edition at http://www.w3.org/TR/REC-xml/#AVNormalize), this behavior should not apply to element value. However, it does when using XML::Twig. I want to be able to read element values "as is" without applying any manipulation to these values, at least whitespace wise. So is there a way and if so what is the way of perserving whitespaces? Thanks.
      -- DJpumps

        Did you try "using the keep_spaces option when you create the twig" as indicated in my previous answer? Did it not do what you want?

Re: XML::Twig and whitespaces
by monkey_boy (Priest) on Aug 23, 2007 at 14:49 UTC
    I modified your code below to see how XML::Simple handles this, is seems to do the same thing, i suspect its one of the underlying modules possibly XML::Parser thats causing this:
    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use XML::Simple; my $xml = XMLin(qq{<root t="space tab\tnewline\nend"/>}); print Dumper $xml;

    output:
    $VAR1 = { 't' => 'space tab newline end' };


    This is not a Signature...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://634644]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-25 17:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found