Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re^3: How to avoid decoding string to utf-8.

by haj (Curate)
on Oct 11, 2020 at 17:43 UTC ( #11122708=note: print w/replies, xml ) Need Help??

in reply to Re^2: How to avoid decoding string to utf-8.
in thread How to avoid decoding string to utf-8.

That black diamond with question marks symbol is the "Unicode replacement character" which is displayed by your terminal / editor for "invalid UTF-8" but not actually part of your string.

If you know that the strings are either encoded from begin to end, or already decoded from begin to end, then you can just apply ikegami's regular expression to every string and decode if there's a match, or plain utf8::decode and hope for your luck. If the borders between encoded and decoded parts within one string isn't clear, you can apply the regular expression repeatedly, so each application removes one to four bytes from your string. It is still likely that you get "correct" results for normal text, though the probability of ambiguities is a bit higher than if you can operate on a string as a whole.

use 5.020; # Heuristically "fix" a broken string use strict; use warnings; use utf8; use Encode qw/encode decode/; my $chars = 'אבגה'; my $bytes = encode('UTF-8',$chars); my $mixed_pickles = "$chars $bytes $chars $bytes"; say "Before: ", encode('UTF-8',$mixed_pickles); my $utf8_decodable_regex = qr/[\xC0-\xDF][\x80-\xBF] | # 2 bytes unicode char [\xE0-\xEF][\x80-\xBF]{2} | # 3 bytes unicode char [\xF0-\xFF][\x80-\xBF]{3}/x; $mixed_pickles =~ s/($utf8_decodable_regex)/ decode('UTF-8',$1,Encode::FB_CROAK | Encode::LEAVE_SRC)/gex; say "After: ", encode('UTF-8',$mixed_pickles);
Some notes about the flags I've used:
  • Encode::FB_CROAK is a safeguard against byte sequences which can be transformed according to the UTF-8 rules, but don't represent valid Unicode characters. An example for such an invalid sequence is "\xEF\xBF\xBE", which transforms to the invalid code point FFFE.
  • Encode::LEAVE_SRC prevents the decoding from changing the input string, which can be a mysterious source of errors. For a similar example, utf8::decode($string); does not return the decoded string, but converts in-place.

Edit: Fixed an somewhat inaccurate description of LEAVE_SRC.

Replies are listed 'Best First'.
Re^4: How to avoid decoding string to utf-8.
by Anonymous Monk on Oct 12, 2020 at 09:03 UTC

    Hi, haj, ikegami, Thank you for the reply.
    I tried with the regex provided, unfortunately it does not seem working, and returning the same result.
    Please note that, I am seeing this result on web application.
    Below is what I have tried,

    my $utf8_decodable_regex = qr/[\xC0-\xDF][\x80-\xBF] | # 2 bytes unicode char [\xE0-\xEF][\x80-\xBF]{2} | # 3 bytes unicode char [\xF0-\xFF][\x80-\xBF]{3}/x; $testStr=~ s/($utf8_decodable_regex)/decode('UTF-8',$1,Enc +ode::FB_CROAK | Encode::LEAVE_SRC)/gex; #$testStr = decode('utf-8',$testStr) if $testStr=~/$utf8_d +ecodable_regex/;
    Any breakthrough would be appreciated, while I am trying to get around this issue.
    Thank you for the efforts.

      If the data comes from a web application, consider that at least for form submissions, the browser sends you the character encoding in a header. If the web application sends the data by Javascript, talk to the web developers that they need to makes sure that their data is always UTF-8.

        Hi Corion,
        Charset seem to set up to utf-8,
        META HTTP-EQUIV="Content-type" CONTENT="text/html; charset=utf-8

        Thank you.

      I'm sorry, but it is unclear to me what "seeing this result on web application" actually means. Where do the data come from? Is your Perl code running as part of the web application, or did you write a web client and are trying to decode a response? How did you build $teststr, and how is it different from the example in my code? Where did you insert the code we suggested?

      In particular, my code example does not return anything, so I can't connect to "returning the same result". Without context, I can't offer any more.

        Hi Haj, Thank you for the reply.

        1.Data comes from the Database, as it is same as you look on the web application.
        2. Yes, Perl code is running as part of web application.
        3. TestStr is basically coming from database which got inserted while submit Form from the application itself, but at the time of showing this string on the web application this issue occurs.
        as I said earlier, I have strings with mixed encodings, which means that one string is differently encoded with another due to upgrade of application from legacy application.

        Thank you.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11122708]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (9)
As of 2021-01-27 17:33 GMT
Find Nodes?
    Voting Booth?