Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

How to Encode/Decode double encoded string.

by Anonymous Monk
on Sep 21, 2020 at 14:20 UTC ( #11122000=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks
I have below string,

this is a test international

whereas in one instance i have below two encoded strings of the above string,

1. this is a test international àáâä
This one is getting encoded properly to utf8 using module encode utf8 function.

2. this is a test international ����
This one gives ??? in the diamond like chars using module encode utf8 function.

It seems that in this instance string is getting double encoded, how we can handle this to work in both the instances.
Thank you.

  • Comment on How to Encode/Decode double encoded string.

Replies are listed 'Best First'.
Re: How to Encode/Decode double encoded string.
by haj (Curate) on Sep 21, 2020 at 15:29 UTC

    To decode characters properly, you need to understand how they are encoded. If all you have are the plain strings, then some ... guesswork ... can't be avoided. Also note that copypasting encoded UTF-8-strings doesn't work well: The UTF-8-encoding of contains bytes which are non-printable control characters. Here, on PerlMonks, these are converted to spaces, so I can't even work with your example string.

    Your second test string does not look like doubly UTF-8 encoded: Try to find out how you created that string, and then we can work from that. If I doubly encode , I get:

     ¡¢¤.

    Nitpick: your question would be easier to read with a little formatting. You can edit your post to add HTML for lists, and code (or examples) are better wrapped between <code> and </code> tags.

      Hi Haj,

      Thank you for the reply.
      I have basically below two strings in the postgres database

      select title from TABLE; title ------------------------------------------- this is a test international ���� #This string got inserte +d in the database using perl version 5.24.1 this is a test international àáâä #This string got inserte +d in the database using oldest version of version i don't know the ex +act version but it seems it this version doesn't handle utf8 by defau +lt, this was inserted using Legacy version of the application.

      How can we get around this issue, as i mentioned earlier first string get the correct result when used Encode module but second doesn't.

      Thank you.

      Hi Haj,
      Please disregard my previous reply, as i messed up with that
      Thank you for the reply. I have basically below two strings in the postgres database

      select title from TABLE; title ------------------------------------------- this is a test international ���� #This string got inserte d in the database using oldest version of version i don't know the ex act version but it seems it this version doesn't handle utf8 by defau lt, this was inserted using Legacy version of the application. </br> this is a test international áâä #This string got inserte d in the database using perl version 5.24.1
      Thank you

        Thanks for the clarifications! It is relevant information that the stuff comes from a Postgres database. There's a lot of encoding done behind the scenes if a database is part of the game. Postgres has a configurable server encoding and a configurable client encoding, either one or both might have changed between the legacy and current application.

        The string is an UTF-8-encoded version of the "Unicode replacement character". You get this by software which tries to decode strings as UTF-8 which contain non-UTF-8 characters, and then encodes this result as UTF-8. I guess that the decoding step gets fed with plain ISO-latin .

        There is a chance that the bogus decoding happens in Perl's Postgres database driver. You can check that by setting the DBH option pg_enable_utf8 to zero when connecting. Your application will then be able to examine the "raw" contents, and decode accordingly.

        A convenient way to examine strings is printf with the "v" format specifier:

        printf "%vx",$string

        From there you can decide how to proceed. Probably you need to re-build the data with a consistent encoding.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11122000]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2020-10-24 09:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (242 votes). Check out past polls.

    Notices?