Re: How to concatenate utf8 safely?

by gregor42 (Parson)
on Oct 25, 2016 at 13:22 UTC

in reply to How to concatenate utf8 safely?

andal : " ... First of all, you have to worry about representation of characters in the octets that you receive from external applications. That depends on locale settings ... "

OP " ... I assume that the problem is my code and not the data coming in since one can usually depend on people to get their own names right ... "

It would appear that my initial assumption was incorrect. I challenged that & as it turns out what I am dealing with is a mixture of localized character sets taken as input from across Europe, cut & pasted between spreadsheets in an HR department spanning multiple offices.


These are conscientious people, mind you, who are concerned about getting the characters just right by potentially editing with multiple programs along the way...

I'm glad that I asked and should have done so sooner.

Now, in the proper mindset and having done my revision a big thing I was missing was that I was using :utf8 instead of :encoding(utf8) which allowed me to regain the trust factor in the data.

I had all kinds of stupid ideas and bad assumptions that led me to chase phantoms. Now at least I can identify mangled input on the way in.

Re^2: How to concatenate utf8 safely?
by choroba (Archbishop) on Oct 25, 2016 at 21:15 UTC
    > :utf8 instead of :encoding(utf8)

    Switch to :encoding(UTF-8) , it's even safer. See Re^2: Read and write UTF-8 for an example.

