Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^3: Unicode strings internals

by kennethk (Abbot)
on May 10, 2013 at 17:39 UTC ( [id://1032996]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Unicode strings internals
in thread [SOLVED] Unicode strings internals

Note that if you modify line 8 to
my $ascii_but_utf = '123';
the output changes to
SV = PV(0x22ae1d0) at 0x2300b20 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x22d21f0 "123\321\204\321\204\321\204\321\204"\0 CUR = 11 LEN = 16 ALL OK
This is because that UTF is on is just a historical artifact of your initialization.

If we take a look at the two output files generated by these two cases, you'll note that both contain 11 bytes, despite the fact that the byte dump of the UTF-upgraded case should have output 19 bytes. This is because the internal representation of high-bit, 1-byte characters under Perl's implementation of UTF is multi-byte even though they cleanly map to 1-byte characters on output. You wouldn't expect these 1-byte characters to output a wide-character warning any more that you'd expect an ASCII character to.

Second case does look like UTF-8 character string,

You're thinking of that wrong; it could be a UTF character string, or a UTF-8 byte string. When dealing with non-ASCII characters in Perl, rare is the case when you should actually be thinking about Perl's internal representation.


#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Replies are listed 'Best First'.
Re^4: Unicode strings internals
by vsespb (Chaplain) on May 10, 2013 at 18:49 UTC
    This is because the internal representation of high-bit, 1-byte characters under Perl's implementation of UTF is multi-byte even though they cleanly map to 1-byte characters on output.

    Ok, you probably mean that some UTF-8 multi-bytes characters (i.e. non-ASCII-7bit characters) are maped to Latin-1 single-bytes encoding (which is "default" in some cases). They are also have codes greater than 127 but less than 255 in Unicode standard.

    Indeed. "Wide character" warning is not thrown when data can be mapped to Latin-1 (single byte encoding). Also data is written in Latin-1, not UTF-8 in this case (different bytes, it's often not what is expected).

    That answers my question. I was wrong in assumption that "Wide character" warning is thrown for any non-ASCII-7bit characters, contained in character strings.

    What is a "wide character"?
    This is a term used both for characters with an ordinal value greater than 127, characters with an ordinal value greater than 255, or any character occupying more than one byte, depending on the context. The Perl warning "Wide character in ..." is caused by a character with an ordinal value greater than 255. With no specified encoding layer, Perl tries to fit things in ISO-8859-1 for backward compatibility reasons. When it can't, it emits this warning (if warnings are enabled), and outputs UTF-8 encoded data instead.
    This:
    You're thinking of that wrong; it could be a UTF character string, or a UTF-8 byte string.
    I do not agree. My understanding that strings with UTF-8 bit on and with length()<>bytes::length() are character strings with non-ASCII-7bit characters. (one might want to check valid() also)
    When dealing with non-ASCII characters in Perl, rare is the case when you should actually be thinking about Perl's internal representation.
    Yes, agree. This is in faq too.
    Please, unless you're hacking the internals, or debugging weirdness, don't think about the UTF8 flag at all. That means that you very probably shouldn't use is_utf8 , _utf8_on or _utf8_off at all.
    But it looks to me that is_utf8, length, bytes::length still can be used before syswrite to raw filehandle, to detect that data is broken (i.e. detect programmer mistake on early stage). I.e. in this case either syswrite will terminate or will write data as latin-1 (which is what is not expected in my case).
      Ok, you probably mean that some UTF-8 multi-bytes characters (i.e. non-ASCII-7bit characters) are maped to Latin-1 single-bytes encoding
      Yes, character mappings are Latin-1 (though I think it's actually CP-1252) by default. Sorry for the lack of clarity.
      You're thinking of that wrong; it could be a UTF character string, or a UTF-8 byte string.
      An essential point, from my perspective, in understanding Unicode is making a differentiation between characters and particular encodings of characters; the difference between Unicode/UTF and UTF-8 (or UCS-2 or UTF-EBCDIC or...). The character string doesn't change when the string is upgraded to internal UTF encoding, even though the byte literals do. But this is largely a semantic argument.
      before syswrite to raw filehandle
      The use of a raw filehandle for output of non-binary data is what throws me in all this. I don't know your particular application, but it seems much more natural to filter for non-ASCII characters by filtering for non-ASCII characters. You can even do better if you are concerned about corrupted data by excluding most control characters:
      if ($string =~ /[^\x{9}\x{10}\x{13}\x{20}-\x{126}]/) { # someone messed up }

      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        The use of a raw filehandle for output of non-binary data

        That was the problem, that I used raw filehandle for binary-only data.

        something like this:

        # $binarydata is binary data, $id is a ASCII-only number, $command is +ASCII-only string. # so result of concatenation should be binary data my $line = "$id\t$command\t$datalength\t$binarydata"; syswrite $file, $line ...

        However i've received $id in another part of program, like this:

        my ($id, $filename) = split (/\t/, $record);

        Problem that $record was UTF-8 character string by intention and contained non-ASCII filename. Thus ASCII-only $id had UTF-8 bit set.

        And thus $line was UTF-8 non-ASCII character string with $binarydata screwed (i.e. bytes converted from Latin-1 to UTF-8).

        Suprisely everything worked fine, as screwed $binarydata was converted back (bytes from UTF-8 to Latin-1) when I wrote it using syswrite().

        So I notices that strange implementation only when added some additional stuff to that code (like I used bytes::length somewhere).

        So I am thinking now, either I am responsible to make sure that $id never will have UTF-8 bit set. Either I should, in additional, test it with "confess if is_utf8($id)". Or maybe I should never concatenate binary data with known ASCII-only-data.Or maybe even never concatenate with known binary data...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1032996]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-18 04:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found