Re: Why does Encode::Repair only correctly fix one of these two tandem characters?
by ikegami (Patriarch) on Aug 09, 2014 at 05:32 UTC
|
$ldqm = encode 'UTF-8', decode 'Windows-1252', encode 'UTF-8', $ldqm;
$ldqm => 201C
encode 'UTF-8' => E2 80 9C
decode 'Windows-1252' => 00E2 20AC 0153
encode 'UTF-8' => C3 A2 E2 82 AC C5 93
$rdqm = encode 'UTF-8', decode 'Windows-1252', encode 'UTF-8', $rdqm;
$rdqm => 201D
encode 'UTF-8' => E2 80 9D
decode 'Windows-1252' => 00E2 20AC ????
[error handling] => 00E2 20AC FFFD
encode 'UTF-8' => C3 A2 E2 82 AC EF BF BD
Windows-1252 doesn't have a character defined for 9D, so when you decode('Windows-1252', "\x9D"), you do something irreversible. The following all result in C3 A2 E2 82 AC EF BF BD.
- U+2001 EM QUAD
- U+200D ZERO WIDTH JOINER
- U+200F RIGHT-TO-LEFT MARK
- U+2010 HYPHEN
- U+201D RIGHT DOUBLE QUOTATION MARK
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Why does Encode::Repair only correctly fix one of these two tandem characters?
by Bethany (Scribe) on Aug 08, 2014 at 23:37 UTC
|
Converting from one encoding to another still gives me trouble, and I've not tried what you're using here. (In other words, take this with a grain of salt.) However, I notice that using Data::Dumper to show the values of the two vars after the encode-decode-encode calls gives values that look "too different" to me:
$VAR1 = '“';
$VAR1 = '�';
My terminal is set up to use UTF-8, so instead of the numeric HTML entities you see above I get the equivalent hex-digits-in-a-box characters. Same deal. Here's what I see in Emacs:
$VAR1 = 'ââ\202¬Å\223';
$VAR1 = 'ââ\202¬ï¿½';
So with the caveat that this is a W.A.G., it looks to me as if maybe the mangling occurs before the calls to repair_double(). | [reply] [Watch: Dir/Any] [d/l] [select] |
|
You're spot on, Bethany. Thanks.
The byte \x9D is being converted to the Unicode character U+FFFD REPLACEMENT CHARACTER (EF BF BD) upstream. So the question now is: What's special about \x9D that isn't special about \x9C?* Hmm…
I added statements to the demonstration script to display a hex dump of the UTF-8 double-encoded bytes:
use charnames qw( :full );
use Encode qw( encode decode );
use Encode::Repair qw( repair_double );
binmode STDOUT, ':encoding(UTF-8)';
my $ldqm = "\N{LEFT DOUBLE QUOTATION MARK}";
my $rdqm = "\N{RIGHT DOUBLE QUOTATION MARK}";
$ldqm = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $ldqm))
+);
$rdqm = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $rdqm))
+);
say join ' ', map { sprintf '%02X', $_ } unpack 'C*', $ldqm;
say join ' ', map { sprintf '%02X', $_ } unpack 'C*', $rdqm;
say repair_double($ldqm, { via => 'Windows-1252' });
say repair_double($rdqm, { via => 'Windows-1252' });
__END__
C3 A2 E2 82 AC C5 93
C3 A2 E2 82 AC EF BF BD
“
��?
*UPDATE: The short answer to the question is that 9C is a defined character in the Windows-1252 character encoding ('œ') and 9D is not. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
my $rdqm = "\N{RIGHT DOUBLE QUOTATION MARK}";
$rdqm = decode('Windows-1252', encode('UTF-8', $rdqm), Encode::FB_CRO
+AK);
__END__
cp1252 "\x9D" does not map to Unicode at .../Encode.pm line 176.
I’m guessing you’ve got an irreversible mojibake situation that will require custom code or lookup tables. Don”t know if this particular case is already covered somewhere.
| [reply] [Watch: Dir/Any] [d/l] |
|
|
|
|
| [reply] [Watch: Dir/Any] |
|
|
Re: Why does Encode::Repair only correctly fix one of these two tandem characters?
by zentara (Archbishop) on Aug 09, 2014 at 15:20 UTC
|
The Perl Weekly Newsletter has an interesting, related article, concerning this: pack's c0 and u0
| [reply] [Watch: Dir/Any] |
|
The C template does characters and the U does the UTF-8. That they exist doesn’t mean that you should use them
I like that! :)
| [reply] [Watch: Dir/Any] |
Re: Why does Encode::Repair only correctly fix one of these two tandem characters?
by remiah (Hermit) on Aug 10, 2014 at 11:38 UTC
|
Hello Jim.
I wonder this is your intentional emulation or, in case you don't notice...
The second decode is really strange. I would like to use terms, "internal char" and "bytes".
Second decode expects Windows-1252 bytes but it receives UTF-8 bytes.
$ldqm=encode('UTF-8', #internal char to utf-8 bytes
decode('Windows-1252', #This expects Windows-1252 bytes b
+ut utf-8 bytes passed from outer encode
encode('UTF-8',$ldgm))); #here internal char to UTF-8bytes
So, how about using from_to, bytes to bytes conversion?
my $buff=encode('UTF-8',$ldgm); #internal char to utf-8
+ bytes
from_to($buff, 'UTF-8', 'Windows-1252'); #now buff converted i
+nto 1252 bytes
$buff=decode('Windows-1252', $buff); #1252 bytes converted
+into internal char
print 'ret=' . encode('UTF-8', $buff); #encode into UTF8 by
+tes and print
regards
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
# <-- 3 <-- 2 <-- 1
$foo = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $foo)));
…more clearly emulates what actually happens in the wild: text is encoded in UTF-8, then wrongly decoded as if it were encoded in Windows-1252, then encoded again in UTF-8. I'm not sure what using the in-place convenience function Encode::from_to() lends to the clarity and effectiveness of the demonstration of the sequence of events.
FWIW, Encode::Repair uses Encode::encode() and Encode::decode(), not Encode::from_to().
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
I am so slow to understand the situation...
I thought proper byte conversion from UTF-8 to cp1252 will solve the whole problem(between 1 and 2 of the
pipeline).
But the wrongly encoded text is already there in the wilderness, and no way back, right?
Then I have no good idea...
| [reply] [Watch: Dir/Any] |
|