Unicode and You

... One code to rule them all

Ladies and gentleman of the Monastery of Perl; perl 5.8. If I could offer you, only one tip for the future 5-8 would be it.

I haven't run perl 5.8 through too many production rigors yet, but as some of you may be aware I have been doing quite a bit of Unicode development. And for this, 5.8 wins hands down. If you plan on working with Unicode (or probably any exotic encoding), upgrade. Upgrade upgrade upgrade. Feed your sysadmin horse tranquilizers if you have to, but upgrade, you'll thank yourself later.

In my recent foray into Unicode I've stumbled across two subtle bugs in 5.6. that'd drive you batty if you didn't know they were there. Firstly, while not a bug and it doesn't agree with the documentation it seems sometimes the utf8 pragma is required for UTF-8 strings. With utf8 on in 5.6.0 and 5.6.1 the regexpen s/^\s{1,0}// and s/^\s{0,0}// (and potentially others, those just happen to be what I ran into, and yes of course they are silly regexps, but they were generated on the fly) will consume all leading whitespace. How's that for lovely? The other bug is a fair bit more subtle. When doing something like print join("", map(chr, 0x17d, 0x17e)) in 5.6.0 an extra pair of bytes are printed before the 4 bytes the code creates. The solution appears to be to not do that. Instead, start with a null, or apparently any other ASCII character, or even print join("", "", map(chr, 0x17d, 0x17e)), :-P

This is not to say the trip will be easy, though it may help if you didn't bother to try it in 5.6, Unicode is not an easy thing to get your mind around. Good luck.

-- perl -pew "s/\b;([mnst])/'$1/g"

Comment on Unicode and You Select or Download Code

Replies are listed 'Best First'.
Re: Unicode and You by Courage (Parson) on Aug 18, 2002 at 11:25 UTC
Could you please enlight a bit more about first bug that you mentioned? It seems like a very weird behaviour. Why perl behaves that way? Is it worth reporting via berlbug? Courage, the Cowardly Dog PS. While I agree with you about benefits of upgrading to 5.8.0 because of better Unicode support, there are some incompatibilities that makes migration harder. (One of examples is that sockets became textmode by default on Win32).	[reply]
Re: Unicode and You by crenz (Priest) on Aug 19, 2002 at 13:23 UTC
Yes, I second your comments. Besides working correctly, perl 5.8 also adds a couple of nifty features. For example, it lets you conveniently set the input and output character sets for a filehandle and will take care of all the necessary encoding for you. And it adds more alphabet/character classes for regexps. One thing I disliked about 5.6.1 was that it was impossible to tell it that I want my in- and output as UTF-8. In some situations, it kept on treating my UTF-8-encoded input as raw 8-bit characters and tried to encode them as UTF-8 again when printing them to STDOUT... While I could solve my problems, it took me a while to work around it. perl 5.6.0 was worse. I did a simple module to convert Chinese traditional characters to simplified ones (Yes, I know there are two on CPAN already, but I had a good reason to do so), using a conversion table in a hash. For whatever reason, 5.6.0 would produce malformed characters, but only in some cases -- on 5.6.1 it works fine. Now the only problem I'm facing is... writing my scripts so they will work well (or fail gracefully) with 5.8.0, 5.6.1, 5.6.0 etc...	[reply]
Re: Re: Unicode and You by belg4mit (Prior) on Aug 19, 2002 at 18:11 UTC
>Now the only problem I'm facing is... writing my scripts so they will work well (or fail gracefully) with 5.8.0, 5.6.1, >5.6.0 etc... That's actually what I'm working on, only I'm keeping the span open for /5\.00\d/ Although I only have to handle input, not output. My solution has been to handle the raw bytes and do the Unicode conversions myself, it seems to work. `-- perl -pew "s/\b;([mnst])/'$1/g"`	[reply]
Re: Unicode and You by crenz (Priest) on Aug 20, 2002 at 14:34 UTC
Well, for my case it would mean I would have to do my own UTF-8 conversion -- which would mean to reimplement a lot of code that's already there with later Perl versions... sounds a bit silly. But I guess it depends on your application.	[reply]


XP is just a number
	PerlMonks