Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

UTF-8 and XML::LibXML

by davies (Prior)
on Nov 26, 2019 at 11:53 UTC ( [id://11109243]=perlquestion: print w/replies, xml ) Need Help??

davies has asked for the wisdom of the Perl Monks concerning the following question:

XML::LibXML seems to be doing strange things to UTF-8 encoded strings.

use strict; use warnings; use Encode qw(encode); use XML::LibXML; my $uchar = chr(195) . chr(154); my $xml = '<?xml version="1.0" encoding="UTF-8"?> <container><node>' . $uchar . '</node></container>'; output($uchar); my $dom = XML::LibXML->load_xml(string => $xml); my $node = $dom->findnodes('/container/node')->to_literal; output($node); my $encoded = encode('UTF-8', $node); output($encoded); sub output { my $str = shift; print "$str\n"; for (1..length($str)) { print ord(substr($str, $_-1)), ': '; } print "\n"; }

Some of my output is below. I have removed the lines printing the characters as that would involve more rendering issues.

195: 154: 218: 195: 154:

My real case is reading files, but I am getting the issue demonstrated in this example. The character I have chosen is one that is causing problems (a U with an acute accent), but other characters are being transformed as well.

Given that the XML is flagged as being UTF-8, I cannot see anything in the docs indicating why this transformation should take place. What have I missed, please?

Regards,

John Davies

Replies are listed 'Best First'.
Re: UTF-8 and XML::LibXML
by choroba (Cardinal) on Nov 26, 2019 at 12:11 UTC
    The behaviour is documented in Encodings Support in XML::LibXML. I don't see what's strange about it: you create a byte string that you later interpret as XML. The document declares its encoding as UTF-8, that's how the $node is interpreted. If you encode the value, you get back the original bytes.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      I have read that, but I am obviously missing the point somewhere. The point where it says "most functions of XML::LibXML that work with in-memory trees accept and return data as character strings (i.e. UTF-8 encoded with the UTF8 flag on)" made me think I would get the same encoded data out as I put in. This is repeated in the second of the basic rules and principles. I'm afraid I can't see anything indicating how to avoid the behaviour I see.

      Regards,

      John Davies

        UTF-8 flag is a misnomer. Strings with this flag are Perl internal Unicode, not UTF-8. When creating/loading XML, use bytes. When supplying values to methods, use Unicode strings (e.g. $element->appendText( $unicode );).

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

        "UTF-8 encoded with the UTF8 flag on" means decoded strings (strings of Unicode Code Points), not strings encoded using UTF-8 (strings of bytes). This is the right thing to do.

        As such, $node contains decode("UTF-8", chr(195) . chr(154)), which is chr(218) ("\N{U+00DA}").

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11109243]
Approved by marto
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2024-04-25 16:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found