comment on

Well, this is a confusing situation.

By default, chr turns a value from 0..255 into a single-character/single-byte ASCII string and turns a value of 256 or larger into a single-character/multi-byte UTF8 string (according to testing and the source code, not according to the documentation). If you have said 'use bytes', then a value of 256 or larger is instead &ed with 0xff and the result converted into a single-character/single-byte ASCII string.

So chr(192) produces an ASCII string (always). Now, regular expressions don't yet know how to distinguish ASCII strings from UTF8 strings (for each string, Perl keeps track of both the bytes that compose the string and whether it thinks those bytes are in ASCII or in UTF8) in order to treat them differently. Instead, whether to treat strings as ASCII or UTF8 is determined when the regular expression is compiled. So a regular expression is compiled to expect UTF8 if you've said 'use utf8' or if there are Unicode characters in the regular expression (not in the string that is being matched).

So chr(192) creates an ASCII string that is not a valid UTF8 string and 'use utf8' causes the regular expression to be compiled to expect UTF8 strings. Then you give it something that isn't valid as a UTF8 string so it fails.

It might be smart (and is a very simple patch) to change chr so that, if you've said 'use utf8', it converts values in 128..255 into single-character/multi-byte UTF8 strings. It might even be wise to have it convert values in 0..127 into single-character/single-byte UTF8 strings (that requires a better vision of where Unicode support is headed in Perl than I currently have).

It is already planned to have regular expressions be compiled into polymorphic code such that the compiled regex can deal with both ASCII and Unicode strings. When that happens, 'use utf8' should no longer affect regular expressions.

Perl's support for Unicode is still in flux and so there are still some inconsistancies and lots of confusing bits.

Currently, if you want the UTF8 string for the character 192, you'll need to convert chr(192) into a UTF8 string. See the encode modules for some ways to do this. perlunicode has more on Unicode support for the different releases of Perl.

- tye

In reply to (tye)Re: problem with chr function by tye
in thread problem with chr function by John M. Dlugosz

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


The stupid question is the question not asked
	PerlMonks