legacy code, utf8 and Perl 5.8.0

barrachois has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to install PerlPoint (a slide presentation system), but it seems to be choking on Perl 5.8.0's version of Unicode.

The runtime error looks like this:

Malformed UTF-8 character (unexpected non-continuation byte 0xf6, imme
+diately after start byte 0xe4) at ppParser.yp line 1477.
Malformed UTF-8 character (unexpected non-continuation byte 0xfc, imme
+diately after start byte 0xf6) at ppParser.yp line 1477.
Malformed UTF-8 character (unexpected non-continuation byte 0xc4, imme
+diately after start byte 0xfc) at ppParser.yp line 1477.
Malformed UTF-8 character (unexpected non-continuation byte 0xd6, imme
+diately after start byte 0xc4) at ppParser.yp line 1477.
Malformed UTF-8 character (unexpected non-continuation byte 0xdc, imme
+diately after start byte 0xd6) at ppParser.yp line 1477.
Malformed UTF-8 character (unexpected non-continuation byte 0xdf, imme
+diately after start byte 0xdc) at ppParser.yp line 1477.
Malformed UTF-8 character (unexpected non-continuation byte 0x5d, imme
+diately after start byte 0xdf) at ppParser.yp line 1477.
[download]

and the corresponding source code line is

# prepare a common pattern
my $patternWUmlauts=qr/[\wäöüÄÖÜß]+/;
[download]

Does anyone know of a "use" pragma or other simple fix which would do what the author intends yet allow this to work under Perl 5.8.0?

Comment on legacy code, utf8 and Perl 5.8.0 Select or Download Code

Replies are listed 'Best First'.
Re: legacy code, utf8 and Perl 5.8.0 by graff (Chancellor) on Feb 13, 2003 at 01:06 UTC
If you happen to have "use utf8" anywhere in your script, this is what is triggering the error messages. Your old code contains single-byte renderings of the accented characters (in whatever character set is native to your data and editor (latin1? cp-something-or-other?). If you don't have "use utf8" anywhere in the script, then there is probably something in your environment that is setting locale in such a way to make Perl assume that "use utf8" ought to be in effect. Anyway, if you put "no utf8" in the script, the problem should go away. Alternately, if you assign that string of single-byte accented characters to a scalar, and use the Encode::decode() method to create a utf8 version of the string, you should then be able to use the utf8 string in the regex: `use Encode; ... my $s = 'äöüÄÖÜß'; my $u = decode('latin1', $s); my $patternWUmlauts=qr/[\w$u]+/; ...` [download] update: Of sourse, you would only use the decode approach if the data to be tested against the regex are in utf8 now, or if you want to make sure to produce utf8 output from the data (in the latter case, input data that happens to be single-byte would need to be decoded as well, before hitting it with the utf8 version of the regex). If the input is still single-byte, and you still want the output to be single-byte, just say "no utf8".	[reply] [d/l]
Re: Re: legacy code, utf8 and Perl 5.8.0 by barrachois (Pilgrim) on Feb 13, 2003 at 15:08 UTC
Thanks; the "no utf8;" seems to work fine.	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks