Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: How to avoid decoding string to utf-8.

by ikegami (Patriarch)
on Oct 11, 2020 at 05:40 UTC ( [id://11122685]=note: print w/replies, xml ) Need Help??


in reply to How to avoid decoding string to utf-8.

is there any way we can check if string needs decoding in utf-8.

Put more clearly, you are asking if there's a way to tell whether a string has already been decoded, or whether it's still encoded using UTF-8.

There's no way to tell for sure.

You could use something like the following:

sub is_valid_utf8 { return $_[0] =~ / ^ (?: [\x00-\x7F] | [\xC0-\xDF] [\x80-\xBF] | [\xE0-\xEF] [\x80-\xBF]{2} | [\xF0-\xF7] [\x80-\xBF]{3} )*+ \z /x } utf8::decode($string) if is_valid_utf8($string);

This simplifies to

utf8::decode($string);

This won't always work. Certain decoded strings are valid UTF-8. That said, these strings would likely be nonsense. I think the conditions I listed here would apply. So the above is actually quite reliable.

Replies are listed 'Best First'.
Re^2: How to avoid decoding string to utf-8.
by Anonymous Monk on Oct 11, 2020 at 15:26 UTC
    Hi all, Thank you for the reply.

    please note that this seems case of mixed encoding.

    I have below two string,

    1. àáâä #This one doesn't need any extra processing 2. àáâä #This one needs to decode .i.e decode('utf-8',$str) which + gives correct result as àáâä
    Problem here is, decode works fine for me with second string but with it First string gets double decoded and convert in to some black diamond with question marks symbols.

    so is there any way that we can differentiate these two strings and apply decode accordingly.

    I hope this time, issue is more clear.

    Thank you

      First string gets double decoded

      You are mistaken. utf8::decode won't do anything if the string is already decoded (except in the very specific and unusual cases I mentioned in my earlier post).

      $ perl -CS -e' my $s = "\xC3\xA0\xC3\xA1\xC3\xA2\xC3\xA4"; printf("Before first decode: %1\$vX [%1\$s]\n", $s); utf8::decode($s) or warn("First decode failed.\n"); printf("After first decode: %1\$vX [%1\$s]\n", $s); utf8::decode($s) or warn("Second decode failed.\n"); printf("After second decode: %1\$vX [%1\$s]\n", $s); ' Before first decode: C3.A0.C3.A1.C3.A2.C3.A4 [àáâä] After first decode: E0.E1.E2.E4 [àáâä] Second decode failed. After second decode: E0.E1.E2.E4 [àáâä]

      That black diamond with question marks symbol is the "Unicode replacement character" which is displayed by your terminal / editor for "invalid UTF-8" but not actually part of your string.

      If you know that the strings are either encoded from begin to end, or already decoded from begin to end, then you can just apply ikegami's regular expression to every string and decode if there's a match, or plain utf8::decode and hope for your luck. If the borders between encoded and decoded parts within one string isn't clear, you can apply the regular expression repeatedly, so each application removes one to four bytes from your string. It is still likely that you get "correct" results for normal text, though the probability of ambiguities is a bit higher than if you can operate on a string as a whole.

      use 5.020; # Heuristically "fix" a broken string use strict; use warnings; use utf8; use Encode qw/encode decode/; my $chars = 'àáâä'; my $bytes = encode('UTF-8',$chars); my $mixed_pickles = "$chars $bytes $chars $bytes"; say "Before: ", encode('UTF-8',$mixed_pickles); my $utf8_decodable_regex = qr/[\xC0-\xDF][\x80-\xBF] | # 2 bytes unicode char [\xE0-\xEF][\x80-\xBF]{2} | # 3 bytes unicode char [\xF0-\xFF][\x80-\xBF]{3}/x; $mixed_pickles =~ s/($utf8_decodable_regex)/ decode('UTF-8',$1,Encode::FB_CROAK | Encode::LEAVE_SRC)/gex; say "After: ", encode('UTF-8',$mixed_pickles);
      Some notes about the flags I've used:
      • Encode::FB_CROAK is a safeguard against byte sequences which can be transformed according to the UTF-8 rules, but don't represent valid Unicode characters. An example for such an invalid sequence is "\xEF\xBF\xBE", which transforms to the invalid code point FFFE.
      • Encode::LEAVE_SRC prevents the decoding from changing the input string, which can be a mysterious source of errors. For a similar example, utf8::decode($string); does not return the decoded string, but converts in-place.

      Edit: Fixed an somewhat inaccurate description of LEAVE_SRC.

        Hi, haj, ikegami, Thank you for the reply.

        I tried with the regex provided, unfortunately it does not seem working, and returning the same result.

        Please note that, I am seeing this result on web application.

        Below is what I have tried,

        my $utf8_decodable_regex = qr/[\xC0-\xDF][\x80-\xBF] | # 2 bytes unicode char [\xE0-\xEF][\x80-\xBF]{2} | # 3 bytes unicode char [\xF0-\xFF][\x80-\xBF]{3}/x; $testStr=~ s/($utf8_decodable_regex)/decode('UTF-8',$1,Enc +ode::FB_CROAK | Encode::LEAVE_SRC)/gex; #$testStr = decode('utf-8',$testStr) if $testStr=~/$utf8_d +ecodable_regex/;
        Any breakthrough would be appreciated, while I am trying to get around this issue.

        Thank you for the efforts.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11122685]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (7)
As of 2024-04-18 05:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found