comment on

(1) Make sure you don't lose any metadata that comes with the text (e.g. charset parameter in MIME Content-type)

(2) If your text includes the ESCAPE character, it may have ISO-2022 shift sequences in it which identifty the character set. All the registered character sets are at http://www.itscj.ipsj.or.jp/ISO-IR/. The actual escape codes are defined in each PDF file. There doesn't seem to be a comprehensive table anywhere on the internet! Note that when ISO registry #165 says that the escape sequence (for G2) is ESC 2/4 2/10 4/5, that means "\e\x24\x2A\x45". (Of course "\x24\x2A\x45" are the characters $ * E

You don't have to understand about G0, G1, G2 to recognize the character sets, although you would to actually translate them to Unicode. I don't know if Encode handles ISO-2022 encoding generally. ICU handles the more commonly used parts of it.

Some general character set links:

ICU - International Components for Unicode in C and Java. It has extremely useful data (in the "data" subdirectory) even if you are not planning to use the code.
http://oss.software.ibm.com/icu/
ISO/IEC International Register of Coded Character Sets To Be Used With Escape Sequences *groan*
http://www.itscj.ipsj.or.jp/ISO-IR/
IANA characters set registry
http://www.iana.org/assignments/character-sets
Ecma Standards E.g. Ecma-35 is the same standard as ISO-2022, but it's free!
http://www.ecma-international.org/publications/standards/Standard.htm
Character Model for the World Wide Web
http://www.w3.org/TR/charmod/
Unicode
http://www.unicode.org/
especially
http://www.unicode.org/Public/UNIDATA/
Roman Czyborra's informative web site
http://czyborra.com/

In reply to Re: Unicode nightmare by Thelonius
in thread Unicode nightmare by perlmonkey2

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Keep It Simple, Stupid
	PerlMonks