Maybe I'm better off forcefully converting all input and output to UTF-8
Yes. For many reasons, it is best to decode all inputs, and encode all output.
I still feel this is a bug in Perl, though.
I believe Perl doesn't support multi-byte locales (e.g. UTF-8).
Effort is placed on Unicode instead instead of adding to the locale system.
Is there a way – perhaps debugging argument – to see what \w applies to?
perlre: Match a "word" character (alphanumeric plus "_").
The following are equivalent:
( No, this is wrong )
/\w/ # When no locale, when not restricted to ASCII
/\p{Word}/
/[_\p{Alnum}]/
/[_\p{Alphabetic}\p{Nd}]/
Derived property "Alphabetic". (100,520 codepoints in Perl 5.12.2)
Unicode character category "Nd". (411 codepoints in Perl 5.12.2)
Actual lists vary by version of Unicode and thus by version of Perl.
|