Encode::Guess might help you on your way.
In general, differentiating between various 8-bit character sets is a hairy problem. If you have nothing else to go on besides the text files, I suspect you're going to need clever heuristics. But try Encode::Guess first; it might be enough. | [reply] [Watch: Dir/Any] |
(1) Make sure you don't lose any metadata that comes with the text (e.g. charset parameter in MIME Content-type)
(2) If your text includes the ESCAPE character, it may have ISO-2022 shift sequences in it which identifty the character set. All the registered character sets are at http://www.itscj.ipsj.or.jp/ISO-IR/. The actual escape codes are defined in each PDF file. There doesn't seem to be a comprehensive table anywhere on the internet! Note that when ISO registry #165 says that the escape sequence (for G2) is ESC 2/4 2/10 4/5, that means "\e\x24\x2A\x45". (Of course "\x24\x2A\x45" are the characters $ * E
You don't have to understand about G0, G1, G2 to recognize the character sets, although you would to actually translate them to Unicode. I don't know if Encode handles ISO-2022 encoding generally. ICU handles the more commonly used parts of it.
Some general character set links:
| [reply] [Watch: Dir/Any] |
No. The encoding is what determines that 65 66 67 should be displayed as ABC (or something else). There's nothing attached to "65" that would indicate US-ASCII should be used.
However, there are ways of determining the probable encoding.
- Searching for the BOM of unicode encodings.
- Eliminating characters sets based on the presence of non-printable or undefined characters.
- Dictionary validation. Check if the text becomes readable when treated as a particular encoding.
- Statistical approaches such as frequency analysis.
Good luck!
| [reply] [Watch: Dir/Any] |
| [reply] [Watch: Dir/Any] |
You might want to make a proper link to a concrete module next time.
Encode-Detect
| [reply] [Watch: Dir/Any] |
Rhesa, thanks for the Unicode::Guess module. I thought I'd poured over all the Encode namespace, but I missed that one.
ikegami, thanks for letting me know what I'm in for. I had guessed that worst case scenario I was going to be doing a lot of eval's for thrown errors when using the wrong encoding. And since no human will be part of the process and I don't have any statistics, identifying a latin capital A with diaresis vs a greek capital delta will probably be impossible. The best I can hope for is to minimize the unintelligible characters. | [reply] [Watch: Dir/Any] |