Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: UTF-8 and XML::LibXML

by choroba (Cardinal)
on Nov 26, 2019 at 12:11 UTC ( [id://11109247]=note: print w/replies, xml ) Need Help??


in reply to UTF-8 and XML::LibXML

The behaviour is documented in Encodings Support in XML::LibXML. I don't see what's strange about it: you create a byte string that you later interpret as XML. The document declares its encoding as UTF-8, that's how the $node is interpreted. If you encode the value, you get back the original bytes.

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Replies are listed 'Best First'.
Re^2: UTF-8 and XML::LibXML
by davies (Prior) on Nov 26, 2019 at 12:17 UTC

    I have read that, but I am obviously missing the point somewhere. The point where it says "most functions of XML::LibXML that work with in-memory trees accept and return data as character strings (i.e. UTF-8 encoded with the UTF8 flag on)" made me think I would get the same encoded data out as I put in. This is repeated in the second of the basic rules and principles. I'm afraid I can't see anything indicating how to avoid the behaviour I see.

    Regards,

    John Davies

      UTF-8 flag is a misnomer. Strings with this flag are Perl internal Unicode, not UTF-8. When creating/loading XML, use bytes. When supplying values to methods, use Unicode strings (e.g. $element->appendText( $unicode );).

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

        I don't think I'm getting confused by the flag as I'm not trying to read or write it. It's the text that talks about character strings that are UTF-8 encoded that may be confusing me, since the output is decoded. I thought I was creating the XML using bytes in the 6th line of my code, but if I'm getting that wrong, I would be interested. But that's not the real problem as I'm getting the same two bytes in my code and in the real files. The only method to which I believe I'm supplying values is the parser. I believe that I am putting encoded data in and getting decoded data back. That is the problem I am trying to solve - I can't see from the docs how to get encoded data back.

        Regards,

        John Davies

      "UTF-8 encoded with the UTF8 flag on" means decoded strings (strings of Unicode Code Points), not strings encoded using UTF-8 (strings of bytes). This is the right thing to do.

      As such, $node contains decode("UTF-8", chr(195) . chr(154)), which is chr(218) ("\N{U+00DA}").

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11109247]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-25 02:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found