Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Hi Monks
I have two strings,
- is properly decoded and encoded in utf-8
- this one works only when using decode utf-8, but while decoding this one, first one get decoded too which is not necessary.
so, my question how to avoid string 1 getting decoded and only string 2 gets decoded in utf-8.
is there any way we can check if string needs decoding in utf-8.
Thank you.
Re: How to avoid decoding string to utf-8.
by choroba (Cardinal) on Oct 09, 2020 at 19:04 UTC
|
A string needs decoding from UTF-8 if it's encoded in UTF-8 and you want to access its unicode characters rather than octets (bytes).
Without seeing the strings and the code ("this one works" -- what does it mean?), we can only guess what you're trying to solve. The problem has no general solution, as an input that's valid UTF-8 might also have meaning without decoding (for example, the bytes c3 83 c2 a5 correspond either to four bytes, or two characters Ã¥, which is an unusual combination usually coming from a double encoded letter å - but what it really is depends on what you want to do with it).
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
Re: How to avoid decoding string to utf-8.
by haukex (Archbishop) on Oct 09, 2020 at 18:53 UTC
|
Sorry, but I think this isn't enough information to give a good answer. Please see my post here with some tips on how to provide us with more information to help answer your question.
| [reply] |
Re: How to avoid decoding string to utf-8.
by ikegami (Patriarch) on Oct 11, 2020 at 05:40 UTC
|
is there any way we can check if string needs decoding in utf-8.
Put more clearly, you are asking if there's a way to tell whether a string has already been decoded, or whether it's still encoded using UTF-8.
There's no way to tell for sure.
You could use something like the following:
sub is_valid_utf8 {
return $_[0] =~ /
^
(?: [\x00-\x7F]
| [\xC0-\xDF] [\x80-\xBF]
| [\xE0-\xEF] [\x80-\xBF]{2}
| [\xF0-\xF7] [\x80-\xBF]{3}
)*+
\z
/x
}
utf8::decode($string)
if is_valid_utf8($string);
This simplifies to
utf8::decode($string);
This won't always work. Certain decoded strings are valid UTF-8. That said, these strings would likely be nonsense. I think the conditions I listed here would apply. So the above is actually quite reliable.
| [reply] [d/l] [select] |
|
1. àáâä #This one doesn't need any extra processing
2. à áâä #This one needs to decode .i.e decode('utf-8',$str) which
+ gives correct result as àáâä
Problem here is, decode works fine for me with second string but with it First string gets double decoded and convert in to some black diamond with question marks symbols.
so is there any way that we can differentiate these two strings and apply decode accordingly.
I hope this time, issue is more clear.
Thank you
| [reply] [d/l] |
|
$ perl -CS -e'
my $s = "\xC3\xA0\xC3\xA1\xC3\xA2\xC3\xA4";
printf("Before first decode: %1\$vX [%1\$s]\n", $s);
utf8::decode($s) or warn("First decode failed.\n");
printf("After first decode: %1\$vX [%1\$s]\n", $s);
utf8::decode($s) or warn("Second decode failed.\n");
printf("After second decode: %1\$vX [%1\$s]\n", $s);
'
Before first decode: C3.A0.C3.A1.C3.A2.C3.A4 [à áâä]
After first decode: E0.E1.E2.E4 [àáâä]
Second decode failed.
After second decode: E0.E1.E2.E4 [àáâä]
| [reply] [d/l] [select] |
|
That black diamond with question marks symbol is the "Unicode replacement character" which is displayed by your terminal / editor for "invalid UTF-8" but not actually part of your string.
If you know that the strings are either encoded from begin to end, or already decoded from begin to end, then you can just apply ikegami's regular expression to every string and decode if there's a match, or plain utf8::decode and hope for your luck. If the borders between encoded and decoded parts within one string isn't clear, you can apply the regular expression repeatedly, so each application removes one to four bytes from your string. It is still likely that you get "correct" results for normal text, though the probability of ambiguities is a bit higher than if you can operate on a string as a whole.
use 5.020;
# Heuristically "fix" a broken string
use strict;
use warnings;
use utf8;
use Encode qw/encode decode/;
my $chars = 'àáâä';
my $bytes = encode('UTF-8',$chars);
my $mixed_pickles = "$chars $bytes $chars $bytes";
say "Before: ", encode('UTF-8',$mixed_pickles);
my $utf8_decodable_regex =
qr/[\xC0-\xDF][\x80-\xBF] | # 2 bytes unicode char
[\xE0-\xEF][\x80-\xBF]{2} | # 3 bytes unicode char
[\xF0-\xFF][\x80-\xBF]{3}/x;
$mixed_pickles =~ s/($utf8_decodable_regex)/
decode('UTF-8',$1,Encode::FB_CROAK | Encode::LEAVE_SRC)/gex;
say "After: ", encode('UTF-8',$mixed_pickles);
Some notes about the flags I've used:
- Encode::FB_CROAK is a safeguard against byte sequences
which can be transformed according to the UTF-8 rules, but don't represent valid Unicode characters.
An example for such an invalid sequence is "\xEF\xBF\xBE", which transforms to the invalid code point FFFE.
- Encode::LEAVE_SRC prevents the decoding from changing the input string, which can be a mysterious source of errors. For a similar example, utf8::decode($string); does not return the decoded string, but converts in-place.
Edit: Fixed an somewhat inaccurate description of LEAVE_SRC. | [reply] [d/l] |
|
|
|
|
|
Re: How to avoid decoding string to utf-8.
by Anonymous Monk on Oct 12, 2020 at 16:52 UTC
|
Hi, haj, ikegami and choroba, Corion, monks who have replied to this thread.
Haj, ikegami, it seems below code worked for me, so regex seems working for me. Thank you.
my $utf8_decodable_regex =
qr/[\xC0-\xDF][\x80-\xBF] | # 2 bytes unicode char
[\xE0-\xEF][\x80-\xBF]{2} | # 3 bytes unicode char
[\xF0-\xFF][\x80-\xBF]{3}/x;
$testStr = decode('utf-8',$testStr);
$testStr =~ s/($utf8_decodable_regex)/decode('utf-8',$1)/gex;
$testStr = encode('utf-8',$testStr);
Though it's working, it would be great if you can explain why its working.
Cheers !!!
| [reply] [d/l] |
|
Since you still didn't reveal what you did to control encoding at the database or web level, I can only guess.
- It looks like the database content is hosed and contains strings in different encodings. You can't reliably SELECT records from these data.
- It seems that you did either not tell your database driver to handle UTF-8, or you omitted to decode content from your web form and write doubly encoded stuff to your database. In both cases, you need that first step of decoding after reading from the database.
- The regular expression takes care for stuff which has been inserted with a second level of UTF-8 encoding. Whenever the substitution succeeds, you found bad data in your database, inserted by either your new code, or by the legacy application. You can capture the return value of the substitution to check whether a substitution took place to identify data which need to be fixed in your database. With ikegami's suggestion to use utf8::decode you can achieve the same goal, a true return value from utf8::decode indicates broken data.
- The final encoding step is required if you print a web response with a charset of UTF-8 without don't specifying an I/O layer for that encoding. Again, without knowing what your code does, I can't say for sure.
Finally, if that code only seems to work, be sure to write a test suite with Unicode data, preferably also including strings with characters which can not be encoded in one byte. Also check the contents of your database with some "non-Perl" code, like the psql command line tool for PostgreSQL or whatever your database engine provides. Without that, your database operations will always be guesswork and the next migration will most likely go wrong as well.
| [reply] |
|
|