Clear questions and runnable code get the best and fastest answer |
|
PerlMonks |
Re: XML and entities, what am I doing wrong?by gildir (Pilgrim) |
on Jun 08, 2001 at 17:19 UTC ( [id://86894]=note: print w/replies, xml ) | Need Help?? |
The real problem is 'The Unicode' Most Perl XML modules are built on top of expat, or XML::Parser which is an interface to expat. Expat is XML parser. It will get your XML (XHTML) document and process its tags and so on. But as XML is fundamentaly based on unicode, expat will convert all your characters to unicode. For this conversion to work properly, you should have valid encoding specified in XML header: <?xml version='1.0' encoding='iso-8859-2'?> This is the primary reason for these odd charaters you encounter. They are utf-8 (8-bit Unicode) representation of non-english characters. You probably want to avoid this coversion. I have similar problem maybe a year ago, but found no useful solution. XML::Parser has a original_string method which returns character data in original encding, but it wont expand entities. And there is no way to get attributes in original encoding. Best solution around this is to use Unicode::Map8 to map all unicode strings back to their original encodig, but this is terribly slow solution for frequent use. So I wrote my own poor man's XML parser based on Perl patterns. But it is not a solution, but a hack. If you plan to use XML, use should better move to Unicode completly. PS: I wonder how XML::Twig implements its keep_encoding option. By forcing expat to behave reasonably or by back conversion to original charset?
In Section
Seekers of Perl Wisdom
|
|