http://qs321.pair.com?node_id=11122004


in reply to How to Encode/Decode double encoded string.

To decode characters properly, you need to understand how they are encoded. If all you have are the plain strings, then some ... guesswork ... can't be avoided. Also note that copypasting encoded UTF-8-strings doesn't work well: The UTF-8-encoding of àáâä contains bytes which are non-printable control characters. Here, on PerlMonks, these are converted to spaces, so I can't even work with your example string.

Your second test string does not look like doubly UTF-8 encoded: Try to find out how you created that string, and then we can work from that. If I doubly encode àáâä, I get:

àáâä.

Nitpick: your question would be easier to read with a little formatting. You can edit your post to add HTML for lists, and code (or examples) are better wrapped between <code> and </code> tags.

Replies are listed 'Best First'.
Re^2: How to Encode/Decode double encoded string.
by Anonymous Monk on Sep 22, 2020 at 05:59 UTC
    Hi Haj,

    Thank you for the reply.
    I have basically below two strings in the postgres database

    select title from TABLE; title ------------------------------------------- this is a test international ���� #This string got inserte +d in the database using perl version 5.24.1 this is a test international àáâä #This string got inserte +d in the database using oldest version of version i don't know the ex +act version but it seems it this version doesn't handle utf8 by defau +lt, this was inserted using Legacy version of the application.
    How can we get around this issue, as i mentioned earlier first string get the correct result when used Encode module but second doesn't.

    Thank you.

Re^2: How to Encode/Decode double encoded string.
by Anonymous Monk on Sep 22, 2020 at 06:06 UTC
    Hi Haj,
    Please disregard my previous reply, as i messed up with that
    Thank you for the reply. I have basically below two strings in the postgres database
    select title from TABLE; title ------------------------------------------- this is a test international ���� #This string got inserte d in the database using oldest version of version i don't know the ex act version but it seems it this version doesn't handle utf8 by defau lt, this was inserted using Legacy version of the application. <br> this is a test international à áâä #This string got inserte d in the database using perl version 5.24.1
    Thank you

      Thanks for the clarifications! It is relevant information that the stuff comes from a Postgres database. There's a lot of encoding done behind the scenes if a database is part of the game. Postgres has a configurable server encoding and a configurable client encoding, either one or both might have changed between the legacy and current application.

      The string � is an UTF-8-encoded version of the "Unicode replacement character". You get this by software which tries to decode strings as UTF-8 which contain non-UTF-8 characters, and then encodes this result as UTF-8. I guess that the decoding step gets fed with plain ISO-latin àáâä.

      There is a chance that the bogus decoding happens in Perl's Postgres database driver. You can check that by setting the DBH option pg_enable_utf8 to zero when connecting. Your application will then be able to examine the "raw" contents, and decode accordingly.

      A convenient way to examine strings is printf with the "v" format specifier:

      printf "%vx",$string

      From there you can decide how to proceed. Probably you need to re-build the data with a consistent encoding.

        Hi Haj,

        Thank you for the reply.

        Yes i did set pg_enable_utf8 = 0, after that only i can see the same above raw strings on the web applications.

        Please note whatever i see in the database, i see as it is in the application too.

        postgres server encoding and client encoding is 'UTF-8', i tried to change the client_encoding to SQL_ASCII but it didn't help.

        I am still not quite sure what exactly needs to be done in order to get around this issue.

        Thank you