Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^4: Unicode strings internals

by vsespb (Chaplain)
on May 10, 2013 at 18:49 UTC ( #1033006=note: print w/replies, xml ) Need Help??


in reply to Re^3: Unicode strings internals
in thread [SOLVED] Unicode strings internals

This is because the internal representation of high-bit, 1-byte characters under Perl's implementation of UTF is multi-byte even though they cleanly map to 1-byte characters on output.

Ok, you probably mean that some UTF-8 multi-bytes characters (i.e. non-ASCII-7bit characters) are maped to Latin-1 single-bytes encoding (which is "default" in some cases). They are also have codes greater than 127 but less than 255 in Unicode standard.

Indeed. "Wide character" warning is not thrown when data can be mapped to Latin-1 (single byte encoding). Also data is written in Latin-1, not UTF-8 in this case (different bytes, it's often not what is expected).

That answers my question. I was wrong in assumption that "Wide character" warning is thrown for any non-ASCII-7bit characters, contained in character strings.

What is a "wide character"?
This is a term used both for characters with an ordinal value greater than 127, characters with an ordinal value greater than 255, or any character occupying more than one byte, depending on the context. The Perl warning "Wide character in ..." is caused by a character with an ordinal value greater than 255. With no specified encoding layer, Perl tries to fit things in ISO-8859-1 for backward compatibility reasons. When it can't, it emits this warning (if warnings are enabled), and outputs UTF-8 encoded data instead.
This:
You're thinking of that wrong; it could be a UTF character string, or a UTF-8 byte string.
I do not agree. My understanding that strings with UTF-8 bit on and with length()<>bytes::length() are character strings with non-ASCII-7bit characters. (one might want to check valid() also)
When dealing with non-ASCII characters in Perl, rare is the case when you should actually be thinking about Perl's internal representation.
Yes, agree. This is in faq too.
Please, unless you're hacking the internals, or debugging weirdness, don't think about the UTF8 flag at all. That means that you very probably shouldn't use is_utf8 , _utf8_on or _utf8_off at all.
But it looks to me that is_utf8, length, bytes::length still can be used before syswrite to raw filehandle, to detect that data is broken (i.e. detect programmer mistake on early stage). I.e. in this case either syswrite will terminate or will write data as latin-1 (which is what is not expected in my case).

Replies are listed 'Best First'.
Re^5: Unicode strings internals
by kennethk (Abbot) on May 10, 2013 at 21:06 UTC
    Ok, you probably mean that some UTF-8 multi-bytes characters (i.e. non-ASCII-7bit characters) are maped to Latin-1 single-bytes encoding
    Yes, character mappings are Latin-1 (though I think it's actually CP-1252) by default. Sorry for the lack of clarity.
    You're thinking of that wrong; it could be a UTF character string, or a UTF-8 byte string.
    An essential point, from my perspective, in understanding Unicode is making a differentiation between characters and particular encodings of characters; the difference between Unicode/UTF and UTF-8 (or UCS-2 or UTF-EBCDIC or...). The character string doesn't change when the string is upgraded to internal UTF encoding, even though the byte literals do. But this is largely a semantic argument.
    before syswrite to raw filehandle
    The use of a raw filehandle for output of non-binary data is what throws me in all this. I don't know your particular application, but it seems much more natural to filter for non-ASCII characters by filtering for non-ASCII characters. You can even do better if you are concerned about corrupted data by excluding most control characters:
    if ($string =~ /[^\x{9}\x{10}\x{13}\x{20}-\x{126}]/) { # someone messed up }

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      The use of a raw filehandle for output of non-binary data

      That was the problem, that I used raw filehandle for binary-only data.

      something like this:

      # $binarydata is binary data, $id is a ASCII-only number, $command is +ASCII-only string. # so result of concatenation should be binary data my $line = "$id\t$command\t$datalength\t$binarydata"; syswrite $file, $line ...

      However i've received $id in another part of program, like this:

      my ($id, $filename) = split (/\t/, $record);

      Problem that $record was UTF-8 character string by intention and contained non-ASCII filename. Thus ASCII-only $id had UTF-8 bit set.

      And thus $line was UTF-8 non-ASCII character string with $binarydata screwed (i.e. bytes converted from Latin-1 to UTF-8).

      Suprisely everything worked fine, as screwed $binarydata was converted back (bytes from UTF-8 to Latin-1) when I wrote it using syswrite().

      So I notices that strange implementation only when added some additional stuff to that code (like I used bytes::length somewhere).

      So I am thinking now, either I am responsible to make sure that $id never will have UTF-8 bit set. Either I should, in additional, test it with "confess if is_utf8($id)". Or maybe I should never concatenate binary data with known ASCII-only-data.Or maybe even never concatenate with known binary data...

        It sounds like your bug would only rear its head when $id actually contains non-ASCII characters. The canonical method for handling this, as I understand it, is to explicitly encode incoming text streams that are potentially problematic; i.e.
        my ($id, $filename) = split (/\t/, $record); $id = encode ("UTF-8", $id);
        I'd watch out for the 'filtering programmer input' trap in all this; the Perl philosophy of giving people as much rope as they like means that a properly-motivated foolish programmer can always outwit your filtering. Since you expect that $id is printable ASCII, I'd more inclined to filter using my regex above, and re-examine the logic the introduced UTF encoding sensitivity into the code in the first place. YMMV, of course.

        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1033006]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (3)
As of 2023-03-30 20:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which type of climate do you prefer to live in?






    Results (74 votes). Check out past polls.

    Notices?