http://qs321.pair.com?node_id=11122067


in reply to Re^2: How to Encode/Decode double encoded string.
in thread How to Encode/Decode double encoded string.

Thanks for the clarifications! It is relevant information that the stuff comes from a Postgres database. There's a lot of encoding done behind the scenes if a database is part of the game. Postgres has a configurable server encoding and a configurable client encoding, either one or both might have changed between the legacy and current application.

The string � is an UTF-8-encoded version of the "Unicode replacement character". You get this by software which tries to decode strings as UTF-8 which contain non-UTF-8 characters, and then encodes this result as UTF-8. I guess that the decoding step gets fed with plain ISO-latin àáâä.

There is a chance that the bogus decoding happens in Perl's Postgres database driver. You can check that by setting the DBH option pg_enable_utf8 to zero when connecting. Your application will then be able to examine the "raw" contents, and decode accordingly.

A convenient way to examine strings is printf with the "v" format specifier:

printf "%vx",$string

From there you can decide how to proceed. Probably you need to re-build the data with a consistent encoding.

Replies are listed 'Best First'.
Re^4: How to Encode/Decode double encoded string.
by Anonymous Monk on Sep 22, 2020 at 11:08 UTC
    Hi Haj,

    Thank you for the reply.

    Yes i did set pg_enable_utf8 = 0, after that only i can see the same above raw strings on the web applications.

    Please note whatever i see in the database, i see as it is in the application too.

    postgres server encoding and client encoding is 'UTF-8', i tried to change the client_encoding to SQL_ASCII but it didn't help.

    I am still not quite sure what exactly needs to be done in order to get around this issue.

    Thank you

      If your database actually contains ����, then it will no longer be able to provide the invalid UTF-8 string which was converted to this sequence of replacement characters. So either the legacy version of the application didn't store correct data in the first place, or something went wrong when converting the server encoding to UTF-8. In the latter case, you might be able to grab an old database backup and restart from there.

      We don't have access to your database nor to your application, so it is up to you to find out which part of software did the bogus UTF-8-decoding (bogus because it failed to check for errors). In Perl, you can (and should) trap this type of error by catching errors like this:

      Use Encode; eval { decode('UTF-8',$string,Encode::FB_CROAK) }; if ($@) { # Well, that $string wasn't a valid UTF-8-string. # Go now and die in what way seems best to you. -- Denethor }