http://qs321.pair.com?node_id=1096817


in reply to Re: Why does Encode::Repair only correctly fix one of these two tandem characters?
in thread Why does Encode::Repair only correctly fix one of these two tandem characters?

You're spot on, Bethany. Thanks.

The byte \x9D is being converted to the Unicode character U+FFFD REPLACEMENT CHARACTER (EF BF BD) upstream. So the question now is:  What's special about \x9D that isn't special about \x9C?* Hmm…

I added statements to the demonstration script to display a hex dump of the UTF-8 double-encoded bytes:

use charnames qw( :full ); use Encode qw( encode decode ); use Encode::Repair qw( repair_double ); binmode STDOUT, ':encoding(UTF-8)'; my $ldqm = "\N{LEFT DOUBLE QUOTATION MARK}"; my $rdqm = "\N{RIGHT DOUBLE QUOTATION MARK}"; $ldqm = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $ldqm)) +); $rdqm = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $rdqm)) +); say join ' ', map { sprintf '%02X', $_ } unpack 'C*', $ldqm; say join ' ', map { sprintf '%02X', $_ } unpack 'C*', $rdqm; say repair_double($ldqm, { via => 'Windows-1252' }); say repair_double($rdqm, { via => 'Windows-1252' }); __END__
C3 A2 E2 82 AC C5 93
C3 A2 E2 82 AC EF BF BD
“
��?

*UPDATE:  The short answer to the question is that 9C is a defined character in the Windows-1252 character encoding ('œ') and 9D is not.

Replies are listed 'Best First'.
Re^3: Why does Encode::Repair only correctly fix one of these two tandem characters?
by Your Mother (Archbishop) on Aug 09, 2014 at 04:54 UTC

    No solution, just more on the “special.”

    my $rdqm = "\N{RIGHT DOUBLE QUOTATION MARK}"; $rdqm = decode('Windows-1252', encode('UTF-8', $rdqm), Encode::FB_CRO +AK); __END__ cp1252 "\x9D" does not map to Unicode at .../Encode.pm line 176.

    I’m guessing you’ve got an irreversible mojibake situation that will require custom code or lookup tables. Don”t know if this particular case is already covered somewhere.

      I’m guessing you’ve got an irreversible mojibake situation that will require custom code or lookup tables.

      In the specific case of the corpus of documents I need to repair (which, by the way, is a very common case), all mojibake are the characters in the Windows-1252 character encoding in the range from 80 through 9F. So I can repair the damaged characters with a small lookup table and a regular expression pattern that matches the substrings that are the damaged characters.

      Here's the script I'll use to repair the many text files with the UTF-8/Windows-1252 character encoding damage in them:

      #!perl use strict; use warnings; use open qw( :encoding(UTF-8) :std ); use English qw( -no_match_vars ); use File::Glob qw( bsd_glob ); @ARGV or die "Usage: perl $PROGRAM_NAME file ...\n"; local @ARGV = map { bsd_glob($ARG) } @ARGV; local $INPLACE_EDIT = '.bak'; # See http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP +1252.TXT my %mojibake_replace = ( "\x{00E2}\x{201A}\x{00AC}" => "\x{20AC}", # 0x80 EURO SIGN "\x{00C2}\x{0081}" => "\x{0081}", # 0x81 UNDEFINED "\x{00E2}\x{20AC}\x{0161}" => "\x{201A}", # 0x82 SINGLE LOW-9 QUOT +ATION MARK "\x{00C6}\x{2019}" => "\x{0192}", # 0x83 LATIN SMALL LETTE +R F WITH HOOK "\x{00E2}\x{20AC}\x{017E}" => "\x{201E}", # 0x84 DOUBLE LOW-9 QUOT +ATION MARK "\x{00E2}\x{20AC}\x{00A6}" => "\x{2026}", # 0x85 HORIZONTAL ELLIPS +IS "\x{00E2}\x{20AC}\x{00A0}" => "\x{2020}", # 0x86 DAGGER "\x{00E2}\x{20AC}\x{00A1}" => "\x{2021}", # 0x87 DOUBLE DAGGER "\x{00CB}\x{2020}" => "\x{02C6}", # 0x88 MODIFIER LETTER C +IRCUMFLEX ACCENT "\x{00E2}\x{20AC}\x{00B0}" => "\x{2030}", # 0x89 PER MILLE SIGN "\x{00C5}\x{00A0}" => "\x{0160}", # 0x8A LATIN CAPITAL LET +TER S WITH CARON "\x{00E2}\x{20AC}\x{00B9}" => "\x{2039}", # 0x8B SINGLE LEFT-POINT +ING ANGLE QUOTATION MARK "\x{00C5}\x{2019}" => "\x{0152}", # 0x8C LATIN CAPITAL LIG +ATURE OE "\x{00C2}\x{008D}" => "\x{008D}", # 0x8D UNDEFINED "\x{00C5}\x{00BD}" => "\x{017D}", # 0x8E LATIN CAPITAL LET +TER Z WITH CARON "\x{00C2}\x{008F}" => "\x{008F}", # 0x8F UNDEFINED "\x{00C2}\x{0090}" => "\x{0090}", # 0x90 UNDEFINED "\x{00E2}\x{20AC}\x{02DC}" => "\x{2018}", # 0x91 LEFT SINGLE QUOTA +TION MARK "\x{00E2}\x{20AC}\x{2122}" => "\x{2019}", # 0x92 RIGHT SINGLE QUOT +ATION MARK "\x{00E2}\x{20AC}\x{0153}" => "\x{201C}", # 0x93 LEFT DOUBLE QUOTA +TION MARK "\x{00E2}\x{20AC}\x{009D}" => "\x{201D}", # 0x94 RIGHT DOUBLE QUOT +ATION MARK "\x{00E2}\x{20AC}\x{00A2}" => "\x{2022}", # 0x95 BULLET "\x{00E2}\x{20AC}\x{201C}" => "\x{2013}", # 0x96 EN DASH "\x{00E2}\x{20AC}\x{201D}" => "\x{2014}", # 0x97 EM DASH "\x{00CB}\x{0153}" => "\x{02DC}", # 0x98 SMALL TILDE "\x{00E2}\x{201E}\x{00A2}" => "\x{2122}", # 0x99 TRADE MARK SIGN "\x{00C5}\x{00A1}" => "\x{0161}", # 0x9A LATIN SMALL LETTE +R S WITH CARON "\x{00E2}\x{20AC}\x{00BA}" => "\x{203A}", # 0x9B SINGLE RIGHT-POIN +TING ANGLE QUOTATION MARK "\x{00C5}\x{201C}" => "\x{0153}", # 0x9C LATIN SMALL LIGAT +URE OE "\x{00C2}\x{009D}" => "\x{009D}", # 0x9D UNDEFINED "\x{00C5}\x{00BE}" => "\x{017E}", # 0x9E LATIN SMALL LETTE +R Z WITH CARON "\x{00C5}\x{00B8}" => "\x{0178}", # 0x9F LATIN CAPITAL LET +TER Y WITH DIAERESIS ); my $mojibake_regex = qr{ ( \x{00C2}[\x{0081}\x{008D}\x{008F}\x{0090}\x{009D}] | \x{00C5}[\x{00A0}\x{00A1}\x{00B8}\x{00BD}\x{00BE}\x{2019}\x{201C}] | \x{00C6}\x{2019} | \x{00CB}[\x{0153}\x{2020}] | \x{00E2}\x{20AC}[\x{009D}\x{00A0}\x{00A1}\x{00A2}\x{00A6}\x{00B0}\ +x{00B9}\x{00BA}\x{0153}\x{0161}\x{017E}\x{02DC}\x{201C}\x{201D}\x{212 +2}] | \x{00E2}\x{201A}\x{00AC} | \x{00E2}\x{201E}\x{00A2} ) }x; while (<ARGV>) { s/$mojibake_regex/$mojibake_replace{$1}/g; print; } exit 0;

      (As always, constructive criticism and earnest suggestions for improvement are welcome and appreciated.)

      Don”t know if this particular case is already covered somewhere.

      I'm a little surprised by this blind spot in Encode::Repair because, in my experience, this is by far the most ubiquitous kind of mojibake in Latin script text (i.e., text in Western European languages). In fairness to its author, moritz, the documentation includes the following Development section:

        Development
            The source code is stored in a public git repository at
            <http://github.com/moritz/Encode-Repair>. If you find any bugs, please
            used the issue tracker linked from this site.
      
            If you find a case of messed-up encodings that can be repaired
            deterministically and that's not covered by this module, please contact
            the author, providing a hex dump of both input and output, and as much
            information of the encoding and decoding process as you have.
      
            Patches are also very welcome.
      
        The following tool takes out having to do all the hard, error-prone work.
        use strict; use warnings; use Encode qw( encode decode ); { my @charset = grep $_ ne "\x{FFFD}", map decode('cp1252', chr($_)), 0x00..0xFF; my %map; for my $dec (@charset) { my $enc = encode 'UTF-8', decode 'cp1252', encode 'UTF-8', $dec; push @{ $map{$enc} }, $dec; } for (values(%map)) { warn(sprintf("Ambiguous: %v04X\n", join '', @$_)) if @$_ > 1; $_ = $_->[0]; } my $pat = join '|', map quotemeta, sort { length($b) <=> length($a) || $a cmp $b } keys %map; my $re = qr/$pat/; while (<>) { s/\G(?:($re)|(.))/ if ($1) { $map{$1} } else { die("Unrecognized sequence starting at pos", $-[2]); } /seg; } }

        It also finds that you have a problem. You can't tell the difference between the following cp1252 characters after they've gone through your encoding-decoding gauntlet:

        • U+00C1 LATIN CAPITAL LETTER A WITH ACUTE
        • U+00CD LATIN CAPITAL LETTER I WITH ACUTE
        • U+00CF LATIN CAPITAL LETTER I WITH DIAERESIS
        • U+00D0 LATIN CAPITAL LETTER ETH
        • U+00DD LATIN CAPITAL LETTER Y WITH ACUTE

        Verification:

        $ perl -MEncode -E' say sprintf "%v02X", encode "UTF-8", decode "cp1252", encode "UTF-8", chr for 0x00C1, 0x00CD, 0x00CF, 0x00D0, 0x00DD; ' C3.83.EF.BF.BD C3.83.EF.BF.BD C3.83.EF.BF.BD C3.83.EF.BF.BD C3.83.EF.BF.BD

        Note: I didn't have the tool check if one messed up sequence can be a substring of another messed up sequence. The sorting by descending length is there to try to handle that case if it exists. Upd: No such case exists.

        The most common garbage from Perl code is mixed UTF-8 and latin-1. It happens when you forgot to specify the output encoding.

        print "\N{LATIN CAPITAL LETTER E WITH ACUTE}"; print "\N{BLACK SPADE SUIT}";

        The first string consists entirely of bytes, so Perl doesn't know you did something wrong. The second string makes no sense, so Perl guesses you meant to encode it using UTF-8. You end up with a mix of code points (effectively latin-1) and UTF-8.

        This is fixed using Encoding::FixLatin

Re^3: Why does Encode::Repair only correctly fix one of these two tandem characters?
by Bethany (Scribe) on Aug 09, 2014 at 00:44 UTC

    Nice detective work there. Just the sort of thing that's probably bitten me in the past except I never did figure out the cause. Well solved.

      Nice detective work there. … Well solved.

      Well, together, we've solved the riddle of why the damage-repair cycle doesn't round-trip the way we'd expect it to do. But we haven't yet solved my real problem of how to repair the damaged characters. I'm back to the drawing board.

        True. I'm sure some monk or other will have ideas better than my vague "that looks odd". :-}