Re^4: UTF-8 and XML::LibXML

I don't think I'm getting confused by the flag as I'm not trying to read or write it. It's the text that talks about character strings that are UTF-8 encoded that may be confusing me, since the output is decoded. I thought I was creating the XML using bytes in the 6th line of my code, but if I'm getting that wrong, I would be interested. But that's not the real problem as I'm getting the same two bytes in my code and in the real files. The only method to which I believe I'm supplying values is the parser. I believe that I am putting encoded data in and getting decoded data back. That is the problem I am trying to solve - I can't see from the docs how to get encoded data back.

Regards,

John Davies

Comment on Re^4: UTF-8 and XML::LibXML

Replies are listed 'Best First'.
Re^5: UTF-8 and XML::LibXML by ikegami (Patriarch) on Nov 26, 2019 at 20:37 UTC
I can't see from the docs how to get encoded data back You might not be able to. The whole point of the parser is to extract the information represented by the XML document, no matter how it's encoded using XML. You shouldn't have to care whether "Ú" is stored as `Ú`, bytes `C3 9A` (in an XML document that uses UTF-8), or byte `DA` (in an XML document that uses cp1252). Nor should you want to know.	[reply] [d/l] [select]
Re^5: UTF-8 and XML::LibXML by choroba (Cardinal) on Nov 26, 2019 at 12:56 UTC
You're not putting any data in. You're creating the XML using bytes, you're getting decoded data back from a method, exactly as documented. This is what putting decoded data in means: `my ($container) = $dom->findnodes('/container'); my $n2 = $container->appendChild('XML::LibXML::Element'->new('node2')) +; $n2->appendText("\N{LATIN CAPITAL LETTER U WITH ACUTE}"); binmode STDOUT, ':encoding(UTF-8)'; print $dom;` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]`	[reply] [d/l] [select]
Re^5: UTF-8 and XML::LibXML by haj (Vicar) on Nov 26, 2019 at 13:33 UTC
I can't see from the docs how to get encoded data back I didn't find a way either, but probably this is intentional because you shouldn't. It is bad practice. As soon as Perl has parsed your document into a tree, it is entitled to forget in whatever encoding it was delivered. If you want encoded data back, then you get to chose the encoding, and encode by yourself. I also think that lots of Perl module documentation should be revisited with regard to the ominous "UTF-8 flag". The parenthesis "(UTF-8 encoded with UTF8 flag on)" is at least misleading and should best be eradicated: the relevant thing is "character string", as opposed to "binary" string ("bytes" and "encoded" strings are binary for that purpose). For the user of any module it isn't relevant in which encoding Perl stores character strings internally.	[reply]
Re^6: UTF-8 and XML::LibXML by davies (Prior) on Nov 26, 2019 at 13:49 UTC
Thanks. If the intention of the module is to decode data, I can react accordingly. I would also agree that the documentation could be improved, but I do have a tendency to rant about documentation and don't think that it would help solve my problem. Regards, John Davies	[reply]


Welcome to the Monastery
	PerlMonks