![]() |
|
We don't bite newbies here... much | |
PerlMonks |
Unicode, regex's, encodings, and all that (Perl 5.6 and 5.8)by John M. Dlugosz (Monsignor) |
on Dec 24, 2002 at 04:08 UTC ( [id://222033]=perlquestion: print w/replies, xml ) | Need Help?? |
John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:
In a module, I was wondering whether to use utf8 or not since it affects the regular expressions. In 5.6, the user of the module would have to pass strings of the matching encoding disciplen or it would not work right. But, I read that in 5.8 the regex is polymorphic and will transparently accept either kind of string, so this is not an issue any more. But, the new perlunicode states, The regular expression compiler produces polymorphic opcodes. That is, the pattern adapts to the data and automatically switches to the Unicode character scheme when presented with Unicode data--or instead uses a traditional byte scheme when presented with byte data. use utf8 still needed to enable UTF-8/UTF-EBCDIC in scripts. {emph. in original}So, does that mean I still need to use utf8 in scope in order to generate this polymorphic code, or only if the regex uses unicode features such as \x{} literals or enhanced meaning of \w, or what? It seems to be saying two different things here. And that's not the only place. In encoding, it states, "The pragma is a per script, not a per block lexical. Only the last use encoding or no encoding matters, and it affects the whole script. ... the use of this pragma inside the module is strongly discouraged (because the influence of this pragma lasts not only for the module but the script that uses). But if you have to, make sure you say no encoding at the end of the module so you contain the influence of the pragma within the module. " So, if you put no encoding at the end of your module's pm file to "contain" it, doesn't that kill any use encoding at the top of the script, since only the last use or no has an effect? And I would think it would affect the file (e.g. module, required or do'ed step), not the whole script, since it would have to make two passes to make the last (overall) affect the earlier-read files. And for run-time require, that just does not compute. If you're discouraged from using it inside a module, what good is it? A Greek can't write his reusable code in Greek code page. And if he writes his main file that way, then it will mess up any modules (encoded as Latin-1) that he tries to use. That is so nuts that I can only suppose that the documentation is broken. What's the real story here? Meanwhile, is use utf8 necessary for extended variable names? use encoding doesn't apply, but I wonder if Perl would take the normal G1 range as letters or (I suppose) as unknowns? —John
Back to
Seekers of Perl Wisdom
|
|