Re: use locale broken?

Read the perldoc perlunicode. There you'll find

   Interaction with Locales
       Use of locales with Unicode data may lead to odd results.  Currently, Perl attempts to
       attach 8-bit locale info to characters in the range 0..255, but this technique is
       demonstrably incorrect for locales that use characters above that range when mapped into
       Unicode.  Perl's Unicode support will also tend to run slower.  Use of locales with
       Unicode is discouraged.

In other words, since you are using UTF-8 encoding for your locale, you don't need to "use locale" in your program. The perl shall use appropriate UNICODE definitions to handle your strings. When you request locale support, you confuse perl and get unexpected things.

Basically, with UNICODE support of perl, you don't need to worry about locale. The locale settings become important only when the data leaves perl script. When this happens, the environment (for example shell) gets just sequence of bytes, which have to be somehow interpreted. The locale define, how they will be interpreted. So your perl code has to make sure that the data it outputs is suitable for the interpretation. So, effectively, you just need to make sure that your file-handles output data in correct encoding.

Comment on Re: use locale broken?

Replies are listed 'Best First'.
Re^2: use locale broken? by december (Pilgrim) on Mar 17, 2011 at 18:01 UTC
I was hoping to have it work both when the user (shell) encoding is in either ISO-8859-1 or UTF-8. Maybe I'm better off forcefully converting all input and output to UTF-8 and have the code itself dealing with UNICODE only. I still feel this is a bug in Perl, though. Is there a way – perhaps debugging argument – to see what `\w` applies to?	[reply] [d/l]
Re^3: use locale broken? (\w) by ikegami (Patriarch) on Mar 17, 2011 at 19:12 UTC
Maybe I'm better off forcefully converting all input and output to UTF-8 Yes. For many reasons, it is best to decode all inputs, and encode all output. I still feel this is a bug in Perl, though. I believe Perl doesn't support multi-byte locales (e.g. UTF-8). Effort is placed on Unicode instead instead of adding to the locale system. Is there a way – perhaps debugging argument – to see what \w applies to? perlre: Match a "word" character (alphanumeric plus "_"). The following are equivalent: ( No, this is wrong ) `/\w/ # When no locale, when not restricted to ASCII /\p{Word}/ /[_\p{Alnum}]/ /[_\p{Alphabetic}\p{Nd}]/` [download] Derived property "Alphabetic". (100,520 codepoints in Perl 5.12.2) Unicode character category "Nd". (411 codepoints in Perl 5.12.2) Actual lists vary by version of Unicode and thus by version of Perl.	[reply] [d/l] [select]


laziness, impatience, and hubris
	PerlMonks