http://qs321.pair.com?node_id=11122708


in reply to Re^2: How to avoid decoding string to utf-8.
in thread How to avoid decoding string to utf-8.

That black diamond with question marks symbol is the "Unicode replacement character" which is displayed by your terminal / editor for "invalid UTF-8" but not actually part of your string.

If you know that the strings are either encoded from begin to end, or already decoded from begin to end, then you can just apply ikegami's regular expression to every string and decode if there's a match, or plain utf8::decode and hope for your luck. If the borders between encoded and decoded parts within one string isn't clear, you can apply the regular expression repeatedly, so each application removes one to four bytes from your string. It is still likely that you get "correct" results for normal text, though the probability of ambiguities is a bit higher than if you can operate on a string as a whole.

use 5.020; # Heuristically "fix" a broken string use strict; use warnings; use utf8; use Encode qw/encode decode/; my $chars = 'абвд'; my $bytes = encode('UTF-8',$chars); my $mixed_pickles = "$chars $bytes $chars $bytes"; say "Before: ", encode('UTF-8',$mixed_pickles); my $utf8_decodable_regex = qr/[\xC0-\xDF][\x80-\xBF] | # 2 bytes unicode char [\xE0-\xEF][\x80-\xBF]{2} | # 3 bytes unicode char [\xF0-\xFF][\x80-\xBF]{3}/x; $mixed_pickles =~ s/($utf8_decodable_regex)/ decode('UTF-8',$1,Encode::FB_CROAK | Encode::LEAVE_SRC)/gex; say "After: ", encode('UTF-8',$mixed_pickles);
Some notes about the flags I've used:

Edit: Fixed an somewhat inaccurate description of LEAVE_SRC.