Re: Re: s/.// increases length - bug or badly documented feature

Reducing length is normal behaviour.

Of course I have example code, but I wanted someone to ask for it first.

A character can be multiple bytes. Usually, when you want multi-byte characters, you use the utf8 pragma. However, multi-byte characters are already possible in strings without that pragma.

#!/usr/bin/perl -l
$_ = chr(12345);
print "Length: ", length; # Length: 1
s/.//;
print "Length: ", length; # Length: 2
[download]

Although you can have a multi-byte character without using the pragma, the dot in the regex apparently still uses bytes (if this snipped had "use utf8", the string would be empty after the substitution). Either the regex is wrong using bytes, or length is wrong using characters. Both are documented to deal with characters, but apparently a character and a character are not the same thing.

chr(12345) is 3 bytes in size, but a single character. With utf8, s/.// removes the first character, and thus all three bytes. Without utf8, s/.// removes the first _BYTE_, leaving two bytes that can't be seen as a single characters, and so making a string of two characters.

Possible solutions would be:

perl dies on the chr(12345) if utf8 is not used
length returns bytes instead of characters when utf8 is not used
regexes use characters instead of bytes when utf8 is not used

I think perl's behaviour is buggy. Please correct me if I'm wrong.

BTW - Please do read perlstyle, because a little extra white space can make your code a lot easier to read

Lbh ebgngrq guvf grkg naq abj lbh pna ernq vg. Fb jung? :)   -- Whreq
[download]

Comment on Re: Re: s/.// increases length - bug or badly documented feature Select or Download Code

Replies are listed 'Best First'.
Re: Re: Re: s/.// increases length - bug or badly documented feature by Biker (Priest) on Mar 01, 2002 at 09:57 UTC
I'm reading from 'perlunicode' here, in my Perl V5.6.0 documentation: Important Caveat WARNING: The implementation of Unicode support in Perl is incomplete. The following areas need further work. Input and Output Disciplines There is currently no easy way to mark data read from a file or other external source as being utf8. This will be one of the major areas of focus in the near future. Regular Expressions The existing regular expression compiler does not produce polymorphic opcodes. This means that the determination on whether to match Unicode characters is made when the pattern is compiled, based on whether the pattern contains Unicode characters, and not when the matching happens at run time. This needs to be changed to adaptively match Unicode if the string to be matched is Unicode. use utf8 still needed to enable a few features The utf8 pragma implements the tables used for Unicode support. These tables are automatically loaded on demand, so the utf8 pragma need not normally be used. However, as a compatibility measure, this pragma must be explicitly used to enable recognition of UTF-8 encoded literals and identifiers in the source text. "a little extra white space can make your code a lot easier to read" Heh! I thought that was easy to read. I explicitly made the code snippet 'readable'. I even put in parentheses around the strings to be printed. Anyway, TMTOWTDI and "style" is just a question of... Well, style. ;-) Everything will go worng!	[reply]
Re: Re: Re: Re: s/.// increases length - bug or badly documented feature by Juerd (Abbot) on Mar 01, 2002 at 19:32 UTC
There is currently no easy way to mark data read from a file or other external source as being utf8. So adding broken unicode-support in a way rendered Perl unusable for external string input. Great! Now we have realy great and fast programming language that can handle text very well, but not if the text has unicode and the utf8 pragma has not been used. Is the moral of this story: "don't just always use strict, always use utf8 too"? `sub byte_length { # depends on bugs no utf8; my ($string) = @_; my $counter; $counter++ while $string =~ s/.//s; return $counter; } sub has_multibytes { my ($string) = @_; return length($string) != byte_length($string); }` [download] Alternatives for these subs are welcome, of course. `Lbh ebgngrq guvf grkg naq abj lbh pna ernq vg. Fb jung? :) -- Whreq` [download]	[reply] [d/l] [select]
Re: Re: Re: s/.// increases length - bug or badly documented feature by mirod (Canon) on Mar 01, 2002 at 09:47 UTC
Unicode in regexp (and in hash keys) is known not to work properlly in anything but bleadperl (maybe!). It will be fully supported in 5.8.0 though.	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks