Problems? Is your data what you think it is? | |
PerlMonks |
Re: Unicode nightmareby Thelonius (Priest) |
on Jul 28, 2006 at 03:04 UTC ( [id://564267]=note: print w/replies, xml ) | Need Help?? |
(1) Make sure you don't lose any metadata that comes with the text (e.g. charset parameter in MIME Content-type)
(2) If your text includes the ESCAPE character, it may have ISO-2022 shift sequences in it which identifty the character set. All the registered character sets are at http://www.itscj.ipsj.or.jp/ISO-IR/. The actual escape codes are defined in each PDF file. There doesn't seem to be a comprehensive table anywhere on the internet! Note that when ISO registry #165 says that the escape sequence (for G2) is ESC 2/4 2/10 4/5, that means "\e\x24\x2A\x45". (Of course "\x24\x2A\x45" are the characters $ * E You don't have to understand about G0, G1, G2 to recognize the character sets, although you would to actually translate them to Unicode. I don't know if Encode handles ISO-2022 encoding generally. ICU handles the more commonly used parts of it.
Some general character set links:
In Section
Seekers of Perl Wisdom
|
|