The following tool takes out having to do all the hard, error-prone work.
use strict;
use warnings;
use Encode qw( encode decode );
{
my @charset =
grep $_ ne "\x{FFFD}",
map decode('cp1252', chr($_)),
0x00..0xFF;
my %map;
for my $dec (@charset) {
my $enc =
encode 'UTF-8',
decode 'cp1252',
encode 'UTF-8',
$dec;
push @{ $map{$enc} }, $dec;
}
for (values(%map)) {
warn(sprintf("Ambiguous: %v04X\n", join '', @$_)) if @$_ > 1;
$_ = $_->[0];
}
my $pat =
join '|',
map quotemeta,
sort { length($b) <=> length($a) || $a cmp $b }
keys %map;
my $re = qr/$pat/;
while (<>) {
s/\G(?:($re)|(.))/
if ($1) {
$map{$1}
} else {
die("Unrecognized sequence starting at pos", $-[2]);
}
/seg;
}
}
It also finds that you have a problem. You can't tell the difference between the following cp1252 characters after they've gone through your encoding-decoding gauntlet:
- U+00C1 LATIN CAPITAL LETTER A WITH ACUTE
- U+00CD LATIN CAPITAL LETTER I WITH ACUTE
- U+00CF LATIN CAPITAL LETTER I WITH DIAERESIS
- U+00D0 LATIN CAPITAL LETTER ETH
- U+00DD LATIN CAPITAL LETTER Y WITH ACUTE
Verification:
$ perl -MEncode -E'
say sprintf "%v02X",
encode "UTF-8", decode "cp1252", encode "UTF-8", chr
for 0x00C1, 0x00CD, 0x00CF, 0x00D0, 0x00DD;
'
C3.83.EF.BF.BD
C3.83.EF.BF.BD
C3.83.EF.BF.BD
C3.83.EF.BF.BD
C3.83.EF.BF.BD
Note: I didn't have the tool check if one messed up sequence can be a substring of another messed up sequence. The sorting by descending length is there to try to handle that case if it exists. Upd: No such case exists.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.