in reply to how to check the encoding of a file
To elaborate on the first reply (which is basically correct, if you are in fact only deciding between Latin-1 and utf8), check the section about "Handling Malformed Data" in the Encode man page: basically, use the "decode" function like this:
#!/usr/bin/perl use strict; use Encode; die "Usage: $0 latin1.file > utf8.file" unless ( @ARGV == 1 and -f $ARGV[0] ); open( my $fh, "<", $ARGV[0] ) or die "$ARGV[0]: $!"; { local $/; $_ = <$fh>; close $fh; } my $utf8; eval { $utf8 = decode( "utf8", $_, Encode::FB_CROAK ) }; if ( $@ ) { # input was not utf8 $utf8 = decode( "iso-8859-1", $_, Encode::FB_WARN ); } binmode STDOUT, ":utf8"; print $utf8;
If you have to decide between utf8, Latin1, Latin2, Cyrillic, Greek, etc, then you have a harder job: to the extent that the non-Latin1 encodings use the same 8-bit range as Latin-1, Encode will happily pretend that they are all Latin-1, thereby converting them all to the wrong set of utf8 characters.
Encode::guess probably won't help you in that case -- you need to train up some bigram character models for each language... (well, maybe the Lingua branch on CPAN has something to handle this by now).
|
---|