http://qs321.pair.com?node_id=1174461

gregor42 has asked for the wisdom of the Perl Monks concerning the following question:

I am humbled and seeking help.

This concerns data containing names so getting it Right is important.

It is likely that I am fundamentally missing something when it comes to safely concatenating strings.

A hand-rolled point solution sometimes works as intended and others times results in the dreaded:

Wide character in syswrite
error.

I assume that the problem is my code and not the data coming in since one can usually depend on people to get their own names right.. But then i18n characters are tricksy, like Hobbits...

sub jibe { my($s,$t) = @_; my $r = join('', (is_utf8($s)?$s:decode('utf8',$s)), (is_utf8($t)?$t +:decode('utf8',$t))); return $r; }

To give it context, let's say that we are creating common name from given name plus surname: (Anglo-centric, I know...)

my $cn = jibe(jibe($givenname," "),$sn);

Thank you in advance for any nudges in the right direction that anyone might provide.



Wait! This isn't a Parachute, this is a Backpack!

Replies are listed 'Best First'.
Re: How to concatenate utf8 safely?
by choroba (Cardinal) on Oct 21, 2016 at 14:34 UTC
    Using is_utf8 somewhere outside of Encode is usually wrong. It doesn't tell you whether the string is UTF-8, it tells you how Perl internally keeps the value.

    Make sure you have the input encoding layer set up properly, and the same for the output. Then, you can just join the strings safely without any hassle.

    BTW, why do you use syswrite instead of print?

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: How to concatenate utf8 safely?
by andal (Hermit) on Oct 24, 2016 at 06:56 UTC

    Normally one shall never worry about joining strings together in perl. Simple "a" . "b" shall work. If you have problem with that, then most likely you don't understand how things work. Try to read perldoc Encode carefully.

    Just in case, here is simplistic description. The applications in computer exchange data as bytes, or octets. "Octets" are not the same as "characters" that humans read. One character can be represented by multiple octets. If your program does not care about characters (it does not try to make them upper or lower case, it does not split on characters etc.) then your program may simply take data in or give data out without worrying about UTF, Unicode or whatever. But usually one has to manipulate characters, that's where confusion starts.

    First of all, you have to worry about representation of characters in the octets that you receive from external applications. That depends on locale settings, but most of modern unixes provide characters encoded as UTF-8. After you receive data from outside, you have to tell perl the encoding of the data, so that perl can split that data on characters. This is done either by using Encode::decode directly, or by adjusting input stream so, that it does this operation for you (by using binmode for example). After this, perl is ready to view your data as characters instead of octets.

    Of course you also have to worry about strings that you type directly into perl code. Perl has to know about their encoding as well. If your editor by default saves all data in UTF-8, then you can put into code "use utf8;" so that perl automatically calls Encode::decode on all your quoted strings and patterns. Or again, without "use utf8;" you can call Encode::decode directly.

    The 2 steps above ensure that perl knows how to split your strings into characters. But if you want to output your character strings to the outside world, you have to do the reverse conversion from "characters string" to "octets string". Again, to do that, you can either call Encode::encode directly, or configure your output stream so that it does it for you automatically.

    If all the steps are handled correctly, then you never have to worry about strings concatenation.

Re: How to concatenate utf8 safely?
by gregor42 (Parson) on Oct 25, 2016 at 13:22 UTC

    andal : " ... First of all, you have to worry about representation of characters in the octets that you receive from external applications. That depends on locale settings ... "

    OP " ... I assume that the problem is my code and not the data coming in since one can usually depend on people to get their own names right ... "

    It would appear that my initial assumption was incorrect. I challenged that & as it turns out what I am dealing with is a mixture of localized character sets taken as input from across Europe, cut & pasted between spreadsheets in an HR department spanning multiple offices.

    (ノωノ)

    These are conscientious people, mind you, who are concerned about getting the characters just right by potentially editing with multiple programs along the way...

    I'm glad that I asked and should have done so sooner.

    Now, in the proper mindset and having done my revision a big thing I was missing was that I was using :utf8 instead of :encoding(utf8) which allowed me to regain the trust factor in the data.

    I had all kinds of stupid ideas and bad assumptions that led me to chase phantoms. Now at least I can identify mangled input on the way in.



    Wait! This isn't a Parachute, this is a Backpack!
      > :utf8 instead of :encoding(utf8)

      Switch to :encoding(UTF-8) , it's even safer. See Re^2: Read and write UTF-8 for an example.

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,