Why does Encode::Repair only correctly fix one of these two tandem characters?

Jim has asked for the wisdom of the Perl Monks concerning the following question:

The function Encode::Repair::repair_double fixes the character U+201C LEFT DOUBLE QUOTATION MARK when double-encoded but not its companion character U+201D RIGHT DOUBLE QUOTATION MARK when double-encoded. Is there a bug in the module or a defect in my expectations? Or is something else wrong?

Here's a script that demonstrates the problem:

use v5.14;
use strict;
use warnings;

use charnames qw( :full );
use Encode qw( encode decode );
use Encode::Repair qw( repair_double );

binmode STDOUT, ':encoding(UTF-8)';

my $ldqm = "\N{LEFT DOUBLE QUOTATION MARK}";
my $rdqm = "\N{RIGHT DOUBLE QUOTATION MARK}";

$ldqm = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $ldqm))
+);
$rdqm = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $rdqm))
+);

say repair_double($ldqm, { via => 'Windows-1252' });
say repair_double($rdqm, { via => 'Windows-1252' });

__END__
[download]

“
��?

Here's the output of the script piped through od:

C:\>perl demo.pl | od -h
0000000000    E2  80  9C  0D  0A  EF  BF  BD  EF  BF  BD  3F  0D  0A
0000000016

C:\>
[download]

E2 80 9C is the correct UTF-8 encoding of the Unicode character U+201C LEFT DOUBLE QUOTATION MARK.

EF BF BD is U+FFFD REPLACEMENT CHARACTER and 3F is U+003F QUESTION MARK. I expect the output to be the single Unicode character U+201D RIGHT DOUBLE QUOTATION MARK instead.

Comment on Why does Encode::Repair only correctly fix one of these two tandem characters? Select or Download Code

Replies are listed 'Best First'.
Re: Why does Encode::Repair only correctly fix one of these two tandem characters? by ikegami (Patriarch) on Aug 09, 2014 at 05:32 UTC
`$ldqm = encode 'UTF-8', decode 'Windows-1252', encode 'UTF-8', $ldqm; $ldqm => 201C encode 'UTF-8' => E2 80 9C decode 'Windows-1252' => 00E2 20AC 0153 encode 'UTF-8' => C3 A2 E2 82 AC C5 93` [download] `$rdqm = encode 'UTF-8', decode 'Windows-1252', encode 'UTF-8', $rdqm; $rdqm => 201D encode 'UTF-8' => E2 80 9D decode 'Windows-1252' => 00E2 20AC ???? [error handling] => 00E2 20AC FFFD encode 'UTF-8' => C3 A2 E2 82 AC EF BF BD` [download] Windows-1252 doesn't have a character defined for `9D`, so when you `decode('Windows-1252', "\x9D")`, you do something irreversible. The following all result in `C3 A2 E2 82 AC EF BF BD`. U+2001 EM QUAD U+200D ZERO WIDTH JOINER U+200F RIGHT-TO-LEFT MARK U+2010 HYPHEN U+201D RIGHT DOUBLE QUOTATION MARK	[reply] [d/l] [select]
Re: Why does Encode::Repair only correctly fix one of these two tandem characters? by Bethany (Scribe) on Aug 08, 2014 at 23:37 UTC
Converting from one encoding to another still gives me trouble, and I've not tried what you're using here. (In other words, take this with a grain of salt.) However, I notice that using Data::Dumper to show the values of the two vars after the encode-decode-encode calls gives values that look "too different" to me: `$VAR1 = 'Гўв¬Е'; $VAR1 = 'Гўв¬пїЅ';` [download] My terminal is set up to use UTF-8, so instead of the numeric HTML entities you see above I get the equivalent hex-digits-in-a-box characters. Same deal. Here's what I see in Emacs: `$VAR1 = 'Гўв\202¬Е\223'; $VAR1 = 'Гўв\202¬пїЅ';` [download] So with the caveat that this is a W.A.G., it looks to me as if maybe the mangling occurs before the calls to repair_double().	[reply] [d/l] [select]
Re^2: Why does Encode::Repair only correctly fix one of these two tandem characters? by Jim (Curate) on Aug 09, 2014 at 00:08 UTC
You're spot on, Bethany. Thanks. The byte `\x9D` is being converted to the Unicode character `U+FFFD REPLACEMENT CHARACTER (EF BF BD)` upstream. So the question now is: What's special about `\x9D` that isn't special about `\x9C`?* Hmm… I added statements to the demonstration script to display a hex dump of the UTF-8 double-encoded bytes: use charnames qw( :full ); use Encode qw( encode decode ); use Encode::Repair qw( repair_double ); binmode STDOUT, ':encoding(UTF-8)'; my $ldqm = "\N{LEFT DOUBLE QUOTATION MARK}"; my $rdqm = "\N{RIGHT DOUBLE QUOTATION MARK}"; $ldqm = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $ldqm)) +); $rdqm = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $rdqm)) +); say join ' ', map { sprintf '%02X', $_ } unpack 'C', $ldqm; say join ' ', map { sprintf '%02X', $_ } unpack 'C', $rdqm; say repair_double($ldqm, { via => 'Windows-1252' }); say repair_double($rdqm, { via => 'Windows-1252' }); __END__ [download] C3 A2 E2 82 AC C5 93 C3 A2 E2 82 AC EF BF BD “ ��? *UPDATE: The short answer to the question is that 9C is a defined character in the Windows-1252 character encoding ('њ') and 9D is not.	[reply] [d/l] [select]
Re^3: Why does Encode::Repair only correctly fix one of these two tandem characters? by Your Mother (Archbishop) on Aug 09, 2014 at 04:54 UTC
No solution, just more on the “special.” `my $rdqm = "\N{RIGHT DOUBLE QUOTATION MARK}"; $rdqm = decode('Windows-1252', encode('UTF-8', $rdqm), Encode::FB_CRO +AK); __END__ cp1252 "\x9D" does not map to Unicode at .../Encode.pm line 176.` [download] I’m guessing you’ve got an irreversible mojibake situation that will require custom code or lookup tables. Don”t know if this particular case is already covered somewhere.	[reply] [d/l]
Re^4: Why does Encode::Repair only correctly fix one of these two tandem characters? by Jim (Curate) on Aug 09, 2014 at 21:04 UTC
Re^5: Why does Encode::Repair only correctly fix one of these two tandem characters? by ikegami (Patriarch) on Aug 11, 2014 at 02:02 UTC
Re^5: Why does Encode::Repair only correctly fix one of these two tandem characters? by ikegami (Patriarch) on Aug 11, 2014 at 01:49 UTC
Re^3: Why does Encode::Repair only correctly fix one of these two tandem characters? by Bethany (Scribe) on Aug 09, 2014 at 00:44 UTC
Nice detective work there. Just the sort of thing that's probably bitten me in the past except I never did figure out the cause. Well solved.	[reply]
Re^4: Why does Encode::Repair only correctly fix one of these two tandem characters? by Jim (Curate) on Aug 09, 2014 at 00:56 UTC
Re^5: Why does Encode::Repair only correctly fix one of these two tandem characters? by Bethany (Scribe) on Aug 09, 2014 at 03:12 UTC
Re: Why does Encode::Repair only correctly fix one of these two tandem characters? by zentara (Archbishop) on Aug 09, 2014 at 15:20 UTC
The Perl Weekly Newsletter has an interesting, related article, concerning this: pack's c0 and u0 I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]
Re^2: Why does Encode::Repair only correctly fix one of these two tandem characters? by Anonymous Monk on Aug 09, 2014 at 22:17 UTC
The C template does characters and the U does the UTF-8. That they exist doesn’t mean that you should use them I like that! :)	[reply]
Re: Why does Encode::Repair only correctly fix one of these two tandem characters? by remiah (Hermit) on Aug 10, 2014 at 11:38 UTC
Hello Jim. I wonder this is your intentional emulation or, in case you don't notice... The second decode is really strange. I would like to use terms, "internal char" and "bytes". Second decode expects Windows-1252 bytes but it receives UTF-8 bytes. `$ldqm=encode('UTF-8', #internal char to utf-8 bytes decode('Windows-1252', #This expects Windows-1252 bytes b +ut utf-8 bytes passed from outer encode encode('UTF-8',$ldgm))); #here internal char to UTF-8bytes` [download] So, how about using from_to, bytes to bytes conversion? `my $buff=encode('UTF-8',$ldgm); #internal char to utf-8 + bytes from_to($buff, 'UTF-8', 'Windows-1252'); #now buff converted i +nto 1252 bytes $buff=decode('Windows-1252', $buff); #1252 bytes converted +into internal char print 'ret=' . encode('UTF-8', $buff); #encode into UTF8 by +tes and print` [download] regards	[reply] [d/l] [select]
Re^2: Why does Encode::Repair only correctly fix one of these two tandem characters? by Jim (Curate) on Aug 10, 2014 at 16:24 UTC
TMTOWTDI. I think the right-to-left pipeline I used to damage the characters for the purpose of the demonstration… `# <-- 3 <-- 2 <-- 1 $foo = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $foo)));` [download] …more clearly emulates what actually happens in the wild: text is encoded in UTF-8, then wrongly decoded as if it were encoded in Windows-1252, then encoded again in UTF-8. I'm not sure what using the in-place convenience function `Encode::from_to()` lends to the clarity and effectiveness of the demonstration of the sequence of events. FWIW, Encode::Repair uses `Encode::encode()` and `Encode::decode()`, not `Encode::from_to()`.	[reply] [d/l] [select]
Re^3: Why does Encode::Repair only correctly fix one of these two tandem characters? by remiah (Hermit) on Aug 11, 2014 at 02:05 UTC
I am so slow to understand the situation... I thought proper byte conversion from UTF-8 to cp1252 will solve the whole problem(between 1 and 2 of the pipeline). But the wrongly encoded text is already there in the wilderness, and no way back, right? Then I have no good idea...	[reply]
Re^4: Why does Encode::Repair only correctly fix one of these two tandem characters? by Jim (Curate) on Aug 11, 2014 at 02:26 UTC


Don't ask to ask, just ask
	PerlMonks