How to Encode/Decode double encoded string.

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

I have below string,

this is a test international рстф

whereas in one instance i have below two encoded strings of the above string,

this is a test international У УЁУЂУЄ
This one is getting encoded properly to utf8 using module encode utf8 function.
this is a test international яПНяПНяПНяПН
This one gives ??? in the diamond like chars using module encode utf8 function.

It seems that in this instance string is getting double encoded, how we can handle this to work in both the instances.

Thank you.

Comment on How to Encode/Decode double encoded string. Select or Download Code

Replies are listed 'Best First'.
Re: How to Encode/Decode double encoded string. by haj (Vicar) on Sep 21, 2020 at 15:29 UTC
To decode characters properly, you need to understand how they are encoded. If all you have are the plain strings, then some ... guesswork ... can't be avoided. Also note that copypasting encoded UTF-8-strings doesn't work well: The UTF-8-encoding of `рстф` contains bytes which are non-printable control characters. Here, on PerlMonks, these are converted to spaces, so I can't even work with your example string. Your second test string does not look like doubly UTF-8 encoded: Try to find out how you created that string, and then we can work from that. If I doubly encode `рстф`, I get: `УТ УТЁУТЂУТЄ`. Nitpick: your question would be easier to read with a little formatting. You can edit your post to add HTML for lists, and code (or examples) are better wrapped between <code> and </code> tags.	[reply] [d/l]
Re^2: How to Encode/Decode double encoded string. by Anonymous Monk on Sep 22, 2020 at 05:59 UTC
Hi Haj, Thank you for the reply. I have basically below two strings in the postgres database `select title from TABLE; title ------------------------------------------- this is a test international яПНяПНяПНяПН #This string got inserte +d in the database using perl version 5.24.1 this is a test international У УЁУЂУЄ #This string got inserte +d in the database using oldest version of version i don't know the ex +act version but it seems it this version doesn't handle utf8 by defau +lt, this was inserted using Legacy version of the application.` [download] How can we get around this issue, as i mentioned earlier first string get the correct result when used Encode module but second doesn't. Thank you.	[reply] [d/l]
Re^2: How to Encode/Decode double encoded string. by Anonymous Monk on Sep 22, 2020 at 06:06 UTC
Hi Haj, Please disregard my previous reply, as i messed up with that Thank you for the reply. I have basically below two strings in the postgres database `select title from TABLE; title ------------------------------------------- this is a test international яПНяПНяПНяПН #This string got inserte d in the database using oldest version of version i don't know the ex act version but it seems it this version doesn't handle utf8 by defau lt, this was inserted using Legacy version of the application. <br> this is a test international У УЁУЂУЄ #This string got inserte d in the database using perl version 5.24.1` [download] Thank you	[reply] [d/l]
Re^3: How to Encode/Decode double encoded string. by haj (Vicar) on Sep 22, 2020 at 08:59 UTC
Thanks for the clarifications! It is relevant information that the stuff comes from a Postgres database. There's a lot of encoding done behind the scenes if a database is part of the game. Postgres has a configurable server encoding and a configurable client encoding, either one or both might have changed between the legacy and current application. The string `яПН` is an UTF-8-encoded version of the "Unicode replacement character". You get this by software which tries to decode strings as UTF-8 which contain non-UTF-8 characters, and then encodes this result as UTF-8. I guess that the decoding step gets fed with plain ISO-latin `рстф`. There is a chance that the bogus decoding happens in Perl's Postgres database driver. You can check that by setting the DBH option `pg_enable_utf8` to zero when connecting. Your application will then be able to examine the "raw" contents, and decode accordingly. A convenient way to examine strings is `printf` with the "v" format specifier: `printf "%vx",$string` From there you can decide how to proceed. Probably you need to re-build the data with a consistent encoding.	[reply] [d/l]
Re^4: How to Encode/Decode double encoded string. by Anonymous Monk on Sep 22, 2020 at 11:08 UTC
Re^5: How to Encode/Decode double encoded string. by haj (Vicar) on Sep 22, 2020 at 11:45 UTC

Back to Seekers of Perl Wisdom