Well, actually this is not quite the whole story (sorry davorg and princepawn ;--).
- XML does _not_ specify the encoding of the characters in a document,
- it strongly encourages the use of UTF-8 or UTF-16 (which are 2 ways of encoding Unicode characters), in fact XML parsers are only required to recognized those 2 encodings,
- if the encoding is _not_ UTF-8 or UTF-16 the the XML declaration must specify the encoding of the document, which hopefully the parser will understand,
- XML::Parser only understands UTF-8, UTF-16 and ISO-8859-1 (latin-1, the encoding commonly used in Western Europe),
- US-ASCII (non accented ASCII characters, all characters (but not control characters) under 127 is a subset of UTF-8. Which means that if you only have to deal with US/English XML data you don't have to bother about it (for now),
- XML::Encodings adds support to a whole lot of common encodings (I think the only one really missing is one of the chinese encodings),
- XML::Parser converts all characters to UTF-8 before passing them to the calling application,
- the cleanest way to go back from UTF-8 to whatever encoding your system likes is to use the Text::Iconv module, provided your system has the iconv library installed,
- a dirty (but sometimes useful) hack is to use the original_string method to get the... original string (pre-UTF-8 conversion), but then you will have to parse start and end tags to extract tag names and attributes,
- if you are converting your XML to HTML you might also want to have a look at HTML::Entities.
One last info: UTF-8 support is now pretty good in Perl but you will have to wait for 5.8 to get UTF-8 hash keys (important for attribute names) and full regexp support.