http://qs321.pair.com?node_id=1096815

Jim has asked for the wisdom of the Perl Monks concerning the following question:

The function Encode::Repair::repair_double fixes the character U+201C LEFT DOUBLE QUOTATION MARK when double-encoded but not its companion character U+201D RIGHT DOUBLE QUOTATION MARK when double-encoded. Is there a bug in the module or a defect in my expectations? Or is something else wrong?

Here's a script that demonstrates the problem:

use v5.14; use strict; use warnings; use charnames qw( :full ); use Encode qw( encode decode ); use Encode::Repair qw( repair_double ); binmode STDOUT, ':encoding(UTF-8)'; my $ldqm = "\N{LEFT DOUBLE QUOTATION MARK}"; my $rdqm = "\N{RIGHT DOUBLE QUOTATION MARK}"; $ldqm = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $ldqm)) +); $rdqm = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $rdqm)) +); say repair_double($ldqm, { via => 'Windows-1252' }); say repair_double($rdqm, { via => 'Windows-1252' }); __END__
“
��?

Here's the output of the script piped through od:

C:\>perl demo.pl | od -h 0000000000 E2 80 9C 0D 0A EF BF BD EF BF BD 3F 0D 0A 0000000016 C:\>

E2 80 9C is the correct UTF-8 encoding of the Unicode character U+201C LEFT DOUBLE QUOTATION MARK.

EF BF BD is U+FFFD REPLACEMENT CHARACTER and 3F is U+003F QUESTION MARK. I expect the output to be the single Unicode character U+201D RIGHT DOUBLE QUOTATION MARK instead.

Replies are listed 'Best First'.
Re: Why does Encode::Repair only correctly fix one of these two tandem characters?
by ikegami (Patriarch) on Aug 09, 2014 at 05:32 UTC
    $ldqm = encode 'UTF-8', decode 'Windows-1252', encode 'UTF-8', $ldqm; $ldqm => 201C encode 'UTF-8' => E2 80 9C decode 'Windows-1252' => 00E2 20AC 0153 encode 'UTF-8' => C3 A2 E2 82 AC C5 93
    $rdqm = encode 'UTF-8', decode 'Windows-1252', encode 'UTF-8', $rdqm; $rdqm => 201D encode 'UTF-8' => E2 80 9D decode 'Windows-1252' => 00E2 20AC ???? [error handling] => 00E2 20AC FFFD encode 'UTF-8' => C3 A2 E2 82 AC EF BF BD

    Windows-1252 doesn't have a character defined for 9D, so when you decode('Windows-1252', "\x9D"), you do something irreversible. The following all result in C3 A2 E2 82 AC EF BF BD.

    • U+2001 EM QUAD
    • U+200D ZERO WIDTH JOINER
    • U+200F RIGHT-TO-LEFT MARK
    • U+2010 HYPHEN
    • U+201D RIGHT DOUBLE QUOTATION MARK
Re: Why does Encode::Repair only correctly fix one of these two tandem characters?
by Bethany (Scribe) on Aug 08, 2014 at 23:37 UTC

    Converting from one encoding to another still gives me trouble, and I've not tried what you're using here. (In other words, take this with a grain of salt.) However, I notice that using Data::Dumper to show the values of the two vars after the encode-decode-encode calls gives values that look "too different" to me:

    $VAR1 = '“'; $VAR1 = '�';

    My terminal is set up to use UTF-8, so instead of the numeric HTML entities you see above I get the equivalent hex-digits-in-a-box characters. Same deal. Here's what I see in Emacs:

    $VAR1 = 'ââ\202¬Å\223'; $VAR1 = 'ââ\202¬ï¿½';

    So with the caveat that this is a W.A.G., it looks to me as if maybe the mangling occurs before the calls to repair_double().

      You're spot on, Bethany. Thanks.

      The byte \x9D is being converted to the Unicode character U+FFFD REPLACEMENT CHARACTER (EF BF BD) upstream. So the question now is:  What's special about \x9D that isn't special about \x9C?* Hmm…

      I added statements to the demonstration script to display a hex dump of the UTF-8 double-encoded bytes:

      use charnames qw( :full ); use Encode qw( encode decode ); use Encode::Repair qw( repair_double ); binmode STDOUT, ':encoding(UTF-8)'; my $ldqm = "\N{LEFT DOUBLE QUOTATION MARK}"; my $rdqm = "\N{RIGHT DOUBLE QUOTATION MARK}"; $ldqm = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $ldqm)) +); $rdqm = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $rdqm)) +); say join ' ', map { sprintf '%02X', $_ } unpack 'C*', $ldqm; say join ' ', map { sprintf '%02X', $_ } unpack 'C*', $rdqm; say repair_double($ldqm, { via => 'Windows-1252' }); say repair_double($rdqm, { via => 'Windows-1252' }); __END__
      C3 A2 E2 82 AC C5 93
      C3 A2 E2 82 AC EF BF BD
      “
      ��?
      

      *UPDATE:  The short answer to the question is that 9C is a defined character in the Windows-1252 character encoding ('œ') and 9D is not.

        No solution, just more on the “special.”

        my $rdqm = "\N{RIGHT DOUBLE QUOTATION MARK}"; $rdqm = decode('Windows-1252', encode('UTF-8', $rdqm), Encode::FB_CRO +AK); __END__ cp1252 "\x9D" does not map to Unicode at .../Encode.pm line 176.

        I’m guessing you’ve got an irreversible mojibake situation that will require custom code or lookup tables. Don”t know if this particular case is already covered somewhere.

        Nice detective work there. Just the sort of thing that's probably bitten me in the past except I never did figure out the cause. Well solved.

Re: Why does Encode::Repair only correctly fix one of these two tandem characters?
by zentara (Archbishop) on Aug 09, 2014 at 15:20 UTC
      The C template does characters and the U does the UTF-8. That they exist doesn’t mean that you should use them
      I like that! :)
Re: Why does Encode::Repair only correctly fix one of these two tandem characters?
by remiah (Hermit) on Aug 10, 2014 at 11:38 UTC

    Hello Jim.

    I wonder this is your intentional emulation or, in case you don't notice...
    The second decode is really strange. I would like to use terms, "internal char" and "bytes".
    Second decode expects Windows-1252 bytes but it receives UTF-8 bytes.

    $ldqm=encode('UTF-8', #internal char to utf-8 bytes decode('Windows-1252', #This expects Windows-1252 bytes b +ut utf-8 bytes passed from outer encode encode('UTF-8',$ldgm))); #here internal char to UTF-8bytes
    So, how about using from_to, bytes to bytes conversion?
    my $buff=encode('UTF-8',$ldgm); #internal char to utf-8 + bytes from_to($buff, 'UTF-8', 'Windows-1252'); #now buff converted i +nto 1252 bytes $buff=decode('Windows-1252', $buff); #1252 bytes converted +into internal char print 'ret=' . encode('UTF-8', $buff); #encode into UTF8 by +tes and print
    regards

      TMTOWTDI.

      I think the right-to-left pipeline I used to damage the characters for the purpose of the demonstration…

      # <-- 3 <-- 2 <-- 1 $foo = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $foo)));

      …more clearly emulates what actually happens in the wild:  text is encoded in UTF-8, then wrongly decoded as if it were encoded in Windows-1252, then encoded again in UTF-8. I'm not sure what using the in-place convenience function Encode::from_to() lends to the clarity and effectiveness of the demonstration of the sequence of events.

      FWIW, Encode::Repair uses Encode::encode() and Encode::decode(), not Encode::from_to().

        I am so slow to understand the situation...

        I thought proper byte conversion from UTF-8 to cp1252 will solve the whole problem(between 1 and 2 of the pipeline).
        But the wrongly encoded text is already there in the wilderness, and no way back, right?

        Then I have no good idea...