Composite Charset Data to UTF8?

AlexTape has asked for the wisdom of the Perl Monks concerning the following question:

Dear omniscient monks,

i try to translate big text files with composite charsets to a constant UTF8 encoding.

anyway my investingation to this topic run into a black whole of nescience.. whats the best way to do it especially with perl?

perhaps you can give me hints or some "simple" explanations how you would do it? i know that there are CPAN::modules to identify "non-utf8" chars but on which level? is it sensefull to take the binary way or to make a comparison on the hexadecimal level?

this is the first time i really get involved with perl into the whole charset jungle..

i´m still mindmapping ;))

kindly, perlig

---- UPDATE ----

Ok. Maybe the Input looks like this:

Textfile with 100 Chars:
40 of them were Italian (it) iso-8859-1, windows-1252
20 of them were Greek (el) iso-8859-7
all others UTF8

(see e.g. http://www.w3.org/International/O-charset-lang.html)

Now i want to process this data.. but my parser is only able to read utf8. for that i have to encode these 60 "non-utf8" chars to utf8 on a certain way..

got it? :)

i´m nearly overstrained :P can you mabe tell me something about the existing guessing modules?!

kindly perlig

$perlig =~ s/pec/cep/g if 'errors expected';

Comment on Composite Charset Data to UTF8?

Replies are listed 'Best First'.
Re: Composite Charset Data to UTF8? by Corion (Patriarch) on Jun 18, 2013 at 10:28 UTC
What do you mean by "composite charset"? The only sane approach is to `Encode::decode` all data as you read it into Perl, and to `Encode::encode` the data to the intended target encoding as you write it. If you don't know the input encoding yet, you have to either use the existing guessing modules or come up with a way of your own to find the "best" possible input encoding of your file(s). For example if you have a dictionary of your source language, you can guess the encoding of a document by finding certain byte sequences that correspond to a word/phrase in that source language. There is very little we can do here without further information. Update: According to your update, you have not exactly mojibake but still a horrible mess of encodings. Maybe you can still employ the approach of having well-known words/phrases to determine where a new encoding starts, but it will be much, much uglier and harder.	[reply]
Re^2: Composite Charset Data to UTF8? by AlexTape (Monk) on Jun 18, 2013 at 14:27 UTC
topic update :) $perlig =~ s/pec/cep/g if 'errors expected';	[reply]
Re: Composite Charset Data to UTF8? by Khen1950fx (Canon) on Jun 18, 2013 at 15:02 UTC
Encode::StdIO is what you're looking for. For example: `#!/usr/bin/perl use strict; use warnings; use Encode::StdIO encoding => 'utf-8';` [download] Your STDOUT and STDERR will automatically be encoded in utf8. Also, note that the author recommends Term::Encoding, so I would install that first, then Encode::StdIO.	[reply] [d/l]
Re^2: Composite Charset Data to UTF8? by AlexTape (Monk) on Jun 19, 2013 at 11:56 UTC
ok, thats like my first approach: `use utf8; use open ':std', ':encoding(UTF-8)'; use open IO => ':encoding(UTF-8)';` [download] but ok.. internal error like this: utf8 "\xA9" does not map to Unicode at /usr/local/share/perl/5.14.2/XML/Tidy.pm line 780. utf8 "\xAE" does not map to Unicode at /usr/local/share/perl/5.14.2/XML/Tidy.pm line 782. anyway that is not the really part of the problem.. anybody got a quick solution to test a file for a constant charset? e.g. true/false for file eq utf8 or not?! can i say that the file is utf after `utf8::decode($_) or die "Input is not valid UTF-8";`just to say there are more then one charsets in the file or not??? or is it part of the problem?! kindly perlig $perlig =~ s/pec/cep/g if 'errors expected';	[reply] [d/l] [select]
Re^3: Composite Charset Data to UTF8? by Corion (Patriarch) on Jun 19, 2013 at 12:07 UTC
Have a look at the encoding rules of UTF-8. A valid UTF-8 sequence starts either with `0b0xxxxxxx` or with `0b11xxxxxx`. So any octet starting with `0xb10xxxxxx` is invalid UTF-8: `> perl -wle "print sprintf '%08b', $_ for (0xa9,0xae)" 10101001 10101110` [download] An untested easy check could be to match your string against `/[\x80-\xBF]/`, which are the hex representations of the bit patterns we've identified: `perl -wle "print sprintf '%08b - %02x', $_,$_ for (0b10000000,0b101111 +11)" 10000000 - 80 10111111 - bf` [download]	[reply] [d/l] [select]


Pathologically Eclectic Rubbish Lister
	PerlMonks