http://qs321.pair.com?node_id=11122000

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

I have below string,

whereas in one instance i have below two encoded strings of the above string,
  1. this is a test international àáâä
    This one is getting encoded properly to utf8 using module encode utf8 function.
  2. this is a test international ����
    This one gives ??? in the diamond like chars using module encode utf8 function.
It seems that in this instance string is getting double encoded, how we can handle this to work in both the instances.

Thank you.

Replies are listed 'Best First'.
Re: How to Encode/Decode double encoded string.
by haj (Vicar) on Sep 21, 2020 at 15:29 UTC

    To decode characters properly, you need to understand how they are encoded. If all you have are the plain strings, then some ... guesswork ... can't be avoided. Also note that copypasting encoded UTF-8-strings doesn't work well: The UTF-8-encoding of àáâä contains bytes which are non-printable control characters. Here, on PerlMonks, these are converted to spaces, so I can't even work with your example string.

    Your second test string does not look like doubly UTF-8 encoded: Try to find out how you created that string, and then we can work from that. If I doubly encode àáâä, I get:

    àáâä.

    Nitpick: your question would be easier to read with a little formatting. You can edit your post to add HTML for lists, and code (or examples) are better wrapped between <code> and </code> tags.

      Hi Haj,

      Thank you for the reply.
      I have basically below two strings in the postgres database

      select title from TABLE; title ------------------------------------------- this is a test international ���� #This string got inserte +d in the database using perl version 5.24.1 this is a test international àáâä #This string got inserte +d in the database using oldest version of version i don't know the ex +act version but it seems it this version doesn't handle utf8 by defau +lt, this was inserted using Legacy version of the application.
      How can we get around this issue, as i mentioned earlier first string get the correct result when used Encode module but second doesn't.

      Thank you.

      Hi Haj,
      Please disregard my previous reply, as i messed up with that
      Thank you for the reply. I have basically below two strings in the postgres database
      select title from TABLE; title ------------------------------------------- this is a test international ���� #This string got inserte d in the database using oldest version of version i don't know the ex act version but it seems it this version doesn't handle utf8 by defau lt, this was inserted using Legacy version of the application. <br> this is a test international à áâä #This string got inserte d in the database using perl version 5.24.1
      Thank you

        Thanks for the clarifications! It is relevant information that the stuff comes from a Postgres database. There's a lot of encoding done behind the scenes if a database is part of the game. Postgres has a configurable server encoding and a configurable client encoding, either one or both might have changed between the legacy and current application.

        The string � is an UTF-8-encoded version of the "Unicode replacement character". You get this by software which tries to decode strings as UTF-8 which contain non-UTF-8 characters, and then encodes this result as UTF-8. I guess that the decoding step gets fed with plain ISO-latin àáâä.

        There is a chance that the bogus decoding happens in Perl's Postgres database driver. You can check that by setting the DBH option pg_enable_utf8 to zero when connecting. Your application will then be able to examine the "raw" contents, and decode accordingly.

        A convenient way to examine strings is printf with the "v" format specifier:

        printf "%vx",$string

        From there you can decide how to proceed. Probably you need to re-build the data with a consistent encoding.