http://qs321.pair.com?node_id=208101

John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

This program
use utf8; chr(192) =~ /\w/;
tells me "Malformed UTF-8 character (unexpected non-continuation byte 0x00 after start byte 0xc0) in pattern match (m//)"

As documented in perlfunc, chr seems to take Unicode characters just fine, producing legal UTF-8 for most values, but for some reason is working wrong for arguments in the range 191 through 255.

Here is more proof:

foreach (90, 192, 257, 0x263a) { print "$_ - ", join ('.',unpack('C*', +chr($_))),"\n" }
gives me:
90 - 90
192 - 192
257 - 196.129
9786 - 226.152.186
Note that 192 maps to a single 192 byte, not 195.128.

What's up?

—John

Replies are listed 'Best First'.
(tye)Re: problem with chr function
by tye (Sage) on Oct 25, 2002 at 20:23 UTC

    Well, this is a confusing situation.

    By default, chr turns a value from 0..255 into a single-character/single-byte ASCII string and turns a value of 256 or larger into a single-character/multi-byte UTF8 string (according to testing and the source code, not according to the documentation). If you have said 'use bytes', then a value of 256 or larger is instead &ed with 0xff and the result converted into a single-character/single-byte ASCII string.

    So chr(192) produces an ASCII string (always). Now, regular expressions don't yet know how to distinguish ASCII strings from UTF8 strings (for each string, Perl keeps track of both the bytes that compose the string and whether it thinks those bytes are in ASCII or in UTF8) in order to treat them differently. Instead, whether to treat strings as ASCII or UTF8 is determined when the regular expression is compiled. So a regular expression is compiled to expect UTF8 if you've said 'use utf8' or if there are Unicode characters in the regular expression (not in the string that is being matched).

    So chr(192) creates an ASCII string that is not a valid UTF8 string and 'use utf8' causes the regular expression to be compiled to expect UTF8 strings. Then you give it something that isn't valid as a UTF8 string so it fails.

    It might be smart (and is a very simple patch) to change chr so that, if you've said 'use utf8', it converts values in 128..255 into single-character/multi-byte UTF8 strings. It might even be wise to have it convert values in 0..127 into single-character/single-byte UTF8 strings (that requires a better vision of where Unicode support is headed in Perl than I currently have).

    It is already planned to have regular expressions be compiled into polymorphic code such that the compiled regex can deal with both ASCII and Unicode strings. When that happens, 'use utf8' should no longer affect regular expressions.

    Perl's support for Unicode is still in flux and so there are still some inconsistancies and lots of confusing bits.

    Currently, if you want the UTF8 string for the character 192, you'll need to convert chr(192) into a UTF8 string. See the encode modules for some ways to do this. perlunicode has more on Unicode support for the different releases of Perl.

            - tye
      I think I understand. chr will convert all values < 256 into a string having byte persuasion, since it can; and only encodes a string of UTF-8 persuasion if it has to.

      Meanwhile, the regex engine is expecting a UTF-8-encoded string and assumes it is, rather than understanding that it has a character whose ordinal is correctly encoded for the kind of string it is. IOW, the regex engine is not respecting the persuasion of the input argument.

      If the regex engine properly treated the input string as an abstract string of characters, regardless of how they were encoded, then it truely would not matter how chr decided to encode it.

      I expected chr to always emit a UTF-8 string if utf8 was in effect.

Re: problem with chr function
by fglock (Vicar) on Oct 25, 2002 at 18:53 UTC

    From utf8 docs:

    Note that if you have bytes with the eighth bit on in your script (for example embedded Latin-1 in your string literals), use utf8 will be unhappy since the bytes are most probably not well-formed UTF-8

    I guess that, since 192 is "malformed", it is not re-encoded to utf8.

    update: From "use encode" docs (perl 5.8 only?):

    This pragma also affects encoding of the 0x80..0xFF code point range: normally characters in that range are left as eight-bit bytes (unless they are combined with characters with code points 0x100 or larger, in which case all characters need to become UTF-8 encoded), but if the encoding pragma is present, even the 0x80..0xFF range always gets UTF-8 encoded.

      No, that is saying that if you used an 8-bit character set in all its glory, then the parser will not like that as UTF8. That is, a byte containing 192 within the script, perhaps in a string literal, would mean whatever the Console's code page thinks in legacy scripts, but in UTF-8 Perl would get upset because it doesn't follow the rules it's assuming for multi-byte characters.

P.S. It's AS build 633 (Re: problem with chr function)
by John M. Dlugosz (Monsignor) on Oct 25, 2002 at 18:52 UTC
    P.S.
    This is perl, v5.6.1 built for MSWin32-x86-multi-thread
    (with 1 registered patch, see perl -V for more detail)
    
    Copyright 1987-2001, Larry Wall
    
    Binary build 633 provided by ActiveState Corp. http://www.ActiveState.com
    Built 21:33:05 Jun 17 2002
    
Re: problem with chr function
by Arrowhead (Monk) on Oct 26, 2002 at 16:36 UTC
    It could just be a bug in perl versions < 5.8.0, as the code runs without complaints on perl5.8.0.