Find encoding that should have been used

Trying to debug an encoding problem? The following will try to figure out what encoding was used and what encoding should have been used.

use 5.008;

use strict;
use warnings;

use Encode qw( encode decode );

use charnames qw( :full );

my $expected
    = "\N{LATIN CAPITAL LETTER I WITH CIRCUMFLEX}";

my $got
    = "\N{LATIN CAPITAL LETTER A WITH TILDE}"
    . "\N{LATIN CAPITAL LETTER Z WITH CARON}";

my @encs = (
    'US-ASCII',
    ( map "UTF-$_",      qw( 7 8 16be 16le 32be 32le ) ),
    ( map "UCS-$_",      qw( 2be 2le 4be 4le         ) ),
    ( map "iso-8859-$_", 1..11, 13..16                 ),
    ( map "Windows-$_",  437, 737, 775, 850, 852, 855,    # OEM pages
                         857, 858, 860, 861, 862, 863,
                         865, 866, 869,
                         874, 932, 936, 949, 950,         # ANSI pages
                         1250..1258,                   ),
);

for my $enc_for_enc (@encs) {
    my $encoded = encode($enc_for_enc, $expected);

    for my $enc_for_dec (@encs) {
        my $decoded = decode($enc_for_dec, $encoded);

        next if $decoded ne $got;

        print("$enc_for_enc as $enc_for_dec:\n");

        for ($decoded =~ /./sg) {
            my $code = ord;
            my $name = charnames::viacode($code);
            printf("(U+%04X) %s\n", $code, $name);
        }

        print("\n");
    }
}
[download]

UTF-8 as Windows-1252:
(U+00C3) LATIN CAPITAL LETTER A WITH TILDE
(U+017D) LATIN CAPITAL LETTER Z WITH CARON
[download]

Known bugs and limitations:

Doesn't provide a means to specify input without modifying the program.
Doesn't handle different codepoints that produce similar graphemes.
Should display nearest matches if there aren't any exact matches.

Comment on Find encoding that should have been used Select or Download Code


Welcome to the Monastery
	PerlMonks