Re^5: Unicode strings internals

in reply to Re^4: Unicode strings internals
in thread [SOLVED] Unicode strings internals

Ok, you probably mean that some UTF-8 multi-bytes characters (i.e. non-ASCII-7bit characters) are maped to Latin-1 single-bytes encoding

Yes, character mappings are Latin-1 (though I think it's actually CP-1252) by default. Sorry for the lack of clarity.

You're thinking of that wrong; it could be a UTF character string, or a UTF-8 byte string.

An essential point, from my perspective, in understanding Unicode is making a differentiation between characters and particular encodings of characters; the difference between Unicode/UTF and UTF-8 (or UCS-2 or UTF-EBCDIC or...). The character string doesn't change when the string is upgraded to internal UTF encoding, even though the byte literals do. But this is largely a semantic argument.

before syswrite to raw filehandle

The use of a raw filehandle for output of non-binary data is what throws me in all this. I don't know your particular application, but it seems much more natural to filter for non-ASCII characters by filtering for non-ASCII characters. You can even do better if you are concerned about corrupted data by excluding most control characters:

if ($string =~ /[^\x{9}\x{10}\x{13}\x{20}-\x{126}]/) {
    # someone messed up
}
[download]

#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Comment on Re^5: Unicode strings internals Download Code

Replies are listed 'Best First'.
Re^6: Unicode strings internals by vsespb (Chaplain) on May 10, 2013 at 21:30 UTC
The use of a raw filehandle for output of non-binary data That was the problem, that I used raw filehandle for binary-only data. something like this: `# $binarydata is binary data, $id is a ASCII-only number, $command is +ASCII-only string. # so result of concatenation should be binary data my $line = "$id\t$command\t$datalength\t$binarydata"; syswrite $file, $line ...` [download] However i've received $id in another part of program, like this: `my ($id, $filename) = split (/\t/, $record);` [download] Problem that $record was UTF-8 character string by intention and contained non-ASCII filename. Thus ASCII-only $id had UTF-8 bit set. And thus $line was UTF-8 non-ASCII character string with $binarydata screwed (i.e. bytes converted from Latin-1 to UTF-8). Suprisely everything worked fine, as screwed $binarydata was converted back (bytes from UTF-8 to Latin-1) when I wrote it using syswrite(). So I notices that strange implementation only when added some additional stuff to that code (like I used bytes::length somewhere). So I am thinking now, either I am responsible to make sure that $id never will have UTF-8 bit set. Either I should, in additional, test it with "confess if is_utf8($id)". Or maybe I should never concatenate binary data with known ASCII-only-data.Or maybe even never concatenate with known binary data...	[reply] [d/l] [select]
Re^7: Unicode strings internals by kennethk (Abbot) on May 10, 2013 at 22:13 UTC
It sounds like your bug would only rear its head when `$id` actually contains non-ASCII characters. The canonical method for handling this, as I understand it, is to explicitly encode incoming text streams that are potentially problematic; i.e. `my ($id, $filename) = split (/\t/, $record); $id = encode ("UTF-8", $id);` [download] I'd watch out for the 'filtering programmer input' trap in all this; the Perl philosophy of giving people as much rope as they like means that a properly-motivated foolish programmer can always outwit your filtering. Since you expect that `$id` is printable ASCII, I'd more inclined to filter using my regex above, and re-examine the logic the introduced UTF encoding sensitivity into the code in the first place. YMMV, of course. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re^8: Unicode strings internals by vsespb (Chaplain) on May 10, 2013 at 22:32 UTC
It sounds like your bug would only rear its head when $id actually contains non-ASCII characters. No! ASCII only - letters and digits. Just like in example of my original posting: `my $utfstring = "123 \x{439}\x{439}\x{439}\x{439}"; my ($ascii_but_utf, undef) = split ' ', $utfstring;` [download]	[reply] [d/l]

In Section Seekers of Perl Wisdom