http://qs321.pair.com?node_id=564168

perlmonkey2 has asked for the wisdom of the Perl Monks concerning the following question:

I deal with a large amount of text from random sources from all over the world. The text is often in a legacy text encoding. Is there a way to automagically determine the encoding type to help ease the pain of encoding it to utf8?

Replies are listed 'Best First'.
Re: Unicode nightmare
by rhesa (Vicar) on Jul 27, 2006 at 17:22 UTC
    Encode::Guess might help you on your way.

    In general, differentiating between various 8-bit character sets is a hairy problem. If you have nothing else to go on besides the text files, I suspect you're going to need clever heuristics. But try Encode::Guess first; it might be enough.

      ++. Exactly!

Re: Unicode nightmare
by Thelonius (Priest) on Jul 28, 2006 at 03:04 UTC
    (1) Make sure you don't lose any metadata that comes with the text (e.g. charset parameter in MIME Content-type)

    (2) If your text includes the ESCAPE character, it may have ISO-2022 shift sequences in it which identifty the character set. All the registered character sets are at http://www.itscj.ipsj.or.jp/ISO-IR/. The actual escape codes are defined in each PDF file. There doesn't seem to be a comprehensive table anywhere on the internet! Note that when ISO registry #165 says that the escape sequence (for G2) is ESC 2/4 2/10 4/5, that means "\e\x24\x2A\x45". (Of course "\x24\x2A\x45" are the characters $ * E

    You don't have to understand about G0, G1, G2 to recognize the character sets, although you would to actually translate them to Unicode. I don't know if Encode handles ISO-2022 encoding generally. ICU handles the more commonly used parts of it.

    Some general character set links:

Re: Unicode nightmare
by ikegami (Patriarch) on Jul 27, 2006 at 17:32 UTC

    No. The encoding is what determines that 65 66 67 should be displayed as ABC (or something else). There's nothing attached to "65" that would indicate US-ASCII should be used.

    However, there are ways of determining the probable encoding.

    • Searching for the BOM of unicode encodings.
    • Eliminating characters sets based on the presence of non-printable or undefined characters.
    • Dictionary validation. Check if the text becomes readable when treated as a particular encoding.
    • Statistical approaches such as frequency analysis.

    Good luck!

Re: Unicode nightmare
by allolex (Curate) on Jul 27, 2006 at 17:19 UTC

    You might want to have a look at the Encode:: namespace on CPAN.

      You might want to make a proper link to a concrete module next time.

      Encode-Detect

Re: Unicode nightmare
by perlmonkey2 (Beadle) on Jul 27, 2006 at 17:57 UTC
    Rhesa, thanks for the Unicode::Guess module. I thought I'd poured over all the Encode namespace, but I missed that one. ikegami, thanks for letting me know what I'm in for. I had guessed that worst case scenario I was going to be doing a lot of eval's for thrown errors when using the wrong encoding. And since no human will be part of the process and I don't have any statistics, identifying a latin capital A with diaresis vs a greek capital delta will probably be impossible. The best I can hope for is to minimize the unintelligible characters.