s/.// increases length - bug or badly documented feature

Juerd has asked for the wisdom of the Perl Monks concerning the following question:

Hi fellow monks,

Should or should it not be possible that s/.// increases a string's length (as reported by length)?
I think it should not be, but if it for any reason should, please let me know.
It is possible to create a string, s/.// it, and have a length that is greater than the length before the substitution was made. The only thing I ask is if it should be possible.

As a little background information, I quote from perldoc -f length:

Returns the length in characters of the value of EXPR.

And a bit from perldoc perlre:

. Match any character (except newline)

Greetings,

Juerd

Comment on s/.// increases length - bug or badly documented feature

Replies are listed 'Best First'.
Re: s/.// increases length - bug or badly documented feature by gmax (Abbot) on Mar 01, 2002 at 09:22 UTC
Very interesting. I remember reading in the Camel book that characters and bytes are treated consistently in RegExes. Apparently, there is still a grey zone in between. Before you showed some code, I thought that you were referring to some eval trick: `$_ = '\040'; print "before: ($_) <",eval 'length("$_")',">\n"; s/.//; print "after: ($_) <",eval 'length("$_")',">\n"; __END__ output: before: (\040) <1> after: (040) <3>` [download] What you have found requires further investigation. _ _ _ _ (_\|\| \| \|(_\|>< _\|	[reply] [d/l]
Re: Re: s/.// increases length - bug or badly documented feature by Juerd (Abbot) on Mar 01, 2002 at 09:27 UTC
Also interesting is that the deparsed code is not equal. `print length chr 12345` [download] outputs "1", and deparses to: `print length "\343\200\271";` [download] which outputs "3" :) Between chr and ord, things are consistent: ord chr 12345 is 12345 (maybe it should return 12345 % 256?) `Lbh ebgngrq guvf grkg naq abj lbh pna ernq vg. Fb jung? :) -- Whreq` [download]	[reply] [d/l] [select]
Re: Re: Re: s/.// increases length - bug or badly documented feature by BrentDax (Hermit) on Mar 01, 2002 at 09:39 UTC
`>bleadperl -MO=Deparse -e "print length chr 12345" print length "\x{3039}"; -e syntax OK` [download] This apparently will be fixed in 5.8.0 (due out in May probably), and I imagine the change will be backported to 5.6.2 as well. =cut --Brent Dax There is no sig.	[reply] [d/l]
Re: s/.// increases length - bug or badly documented feature by clemburg (Curate) on Mar 01, 2002 at 09:39 UTC
I think this is a bug. But it is a documented bug. From the documentation on perlunicode (emphasis added by me): The following areas need further work. They are being rapidly addressed in the 5.7.x development branch. ... Regular Expressions The existing regular expression compiler does not produce polymorphic opcodes. This means that the determination on whether to match Unicode characters is made when the pattern is compiled, based on whether the pattern contains Unicode characters, and not when the matching happens at run time. This needs to be changed to adaptively match Unicode if the string to be matched is Unicode. To see this, I put in a unicode character in a position guaranteed not to match: `#!/usr/bin/perl -l $_ = chr(12345); print "Length: ", length; # Length: 1 s/.\|[^$_]//; print "Length: ", length; # Length: 2 # prints: # Length: 1 # Length: 0` [download] Christian Lemburg Brainbench MVP for Perl http://www.brainbench.com	[reply] [d/l]
Re: s/.// increases length - bug or badly documented feature by Biker (Priest) on Mar 01, 2002 at 08:47 UTC
This script will reduce the length of the string by one character: `#!/usr/bin/perl -w use strict; my$str="Biker\n"; print('Length of Biker: '.length($str)."\n"); $str=~s/.//; print('Length of Biker: '.length($str)."\n");` [download] #BikerString.pl Length of Biker: 6 Length of Biker: 5 Do you have any code example giving the strange behavior you're referring to? Everything will go worng!	[reply] [d/l]
Re: Re: s/.// increases length - bug or badly documented feature by Juerd (Abbot) on Mar 01, 2002 at 09:07 UTC
Reducing length is normal behaviour. Of course I have example code, but I wanted someone to ask for it first. A character can be multiple bytes. Usually, when you want multi-byte characters, you use the `utf8` pragma. However, multi-byte characters are already possible in strings without that pragma. `#!/usr/bin/perl -l $_ = chr(12345); print "Length: ", length; # Length: 1 s/.//; print "Length: ", length; # Length: 2` [download] Although you can have a multi-byte character without using the pragma, the dot in the regex apparently still uses bytes (if this snipped had "use utf8", the string would be empty after the substitution). Either the regex is wrong using bytes, or length is wrong using characters. Both are documented to deal with characters, but apparently a character and a character are not the same thing. `chr(12345)` is 3 bytes in size, but a single character. With utf8, `s/.//` removes the first character, and thus all three bytes. Without utf8, `s/.//` removes the first _BYTE_, leaving two bytes that can't be seen as a single characters, and so making a string of two characters. Possible solutions would be: perl dies on the chr(12345) if utf8 is not used length returns bytes instead of characters when utf8 is not used regexes use characters instead of bytes when utf8 is not used I think perl's behaviour is buggy. Please correct me if I'm wrong. BTW - Please do read perlstyle, because a little extra white space can make your code a lot easier to read `Lbh ebgngrq guvf grkg naq abj lbh pna ernq vg. Fb jung? :) -- Whreq` [download]	[reply] [d/l] [select]
Re: Re: Re: s/.// increases length - bug or badly documented feature by Biker (Priest) on Mar 01, 2002 at 09:57 UTC
I'm reading from 'perlunicode' here, in my Perl V5.6.0 documentation: Important Caveat WARNING: The implementation of Unicode support in Perl is incomplete. The following areas need further work. Input and Output Disciplines There is currently no easy way to mark data read from a file or other external source as being utf8. This will be one of the major areas of focus in the near future. Regular Expressions The existing regular expression compiler does not produce polymorphic opcodes. This means that the determination on whether to match Unicode characters is made when the pattern is compiled, based on whether the pattern contains Unicode characters, and not when the matching happens at run time. This needs to be changed to adaptively match Unicode if the string to be matched is Unicode. use utf8 still needed to enable a few features The utf8 pragma implements the tables used for Unicode support. These tables are automatically loaded on demand, so the utf8 pragma need not normally be used. However, as a compatibility measure, this pragma must be explicitly used to enable recognition of UTF-8 encoded literals and identifiers in the source text. "a little extra white space can make your code a lot easier to read" Heh! I thought that was easy to read. I explicitly made the code snippet 'readable'. I even put in parentheses around the strings to be printed. Anyway, TMTOWTDI and "style" is just a question of... Well, style. ;-) Everything will go worng!	[reply]
Re: Re: Re: Re: s/.// increases length - bug or badly documented feature by Juerd (Abbot) on Mar 01, 2002 at 19:32 UTC
Re: Re: Re: s/.// increases length - bug or badly documented feature by mirod (Canon) on Mar 01, 2002 at 09:47 UTC
Unicode in regexp (and in hash keys) is known not to work properlly in anything but bleadperl (maybe!). It will be fully supported in 5.8.0 though.	[reply]
Re: s/.// increases length - bug or badly documented feature by mattr (Curate) on Mar 01, 2002 at 10:38 UTC
Interesting, even if polymorphic regex not available yet there is a related regex recompiler (ShiftJIS-Regex) for Shift-JIS Japanese which is 2 bytes per character. I could certainly see that failing to shrink the string with a dot; got some very interesting results when I commented out the use statement once last year.	[reply]
Re: s/.// increases length - bug or badly documented feature by steves (Curate) on Mar 01, 2002 at 09:56 UTC
A note almost identical to the quote above is found on page 409 of The Camel book, 3rd edition.	[reply]
Re: s/.// increases length - bug or badly documented feature by erikharrison (Deacon) on Mar 01, 2002 at 18:38 UTC
All I have is Camel 2nd, and but I remember it stating that in regexes, characters are really just bytes for now( as several people have already mentioned). What I don't understand is why the length increases. I can understand not getting the desired behavior, but deleting a byte from a string should still reduce the length, correct? Or at least keep it the same? Cheers, Erik	[reply]

Back to Seekers of Perl Wisdom