http://qs321.pair.com?node_id=148507

Juerd has asked for the wisdom of the Perl Monks concerning the following question:

Hi fellow monks,

Should or should it not be possible that s/.// increases a string's length (as reported by length)?
I think it should not be, but if it for any reason should, please let me know.
It is possible to create a string, s/.// it, and have a length that is greater than the length before the substitution was made. The only thing I ask is if it should be possible.

As a little background information, I quote from perldoc -f length:
Returns the length in characters of the value of EXPR.
And a bit from perldoc perlre:
. Match any character (except newline)


Greetings,

Juerd
  • Comment on s/.// increases length - bug or badly documented feature

Replies are listed 'Best First'.
Re: s/.// increases length - bug or badly documented feature
by gmax (Abbot) on Mar 01, 2002 at 09:22 UTC
    Very interesting. I remember reading in the Camel book that characters and bytes are treated consistently in RegExes. Apparently, there is still a grey zone in between.

    Before you showed some code, I thought that you were referring to some eval trick:
    $_ = '\040'; print "before: ($_) <",eval 'length("$_")',">\n"; s/.//; print "after: ($_) <",eval 'length("$_")',">\n"; __END__ output: before: (\040) <1> after: (040) <3>
    What you have found requires further investigation.
     _  _ _  _  
    (_|| | |(_|><
     _|   
    
      Also interesting is that the deparsed code is not equal.
      print length chr 12345
      outputs "1", and deparses to:
      print length "\343\200\271";
      which outputs "3" :)

      Between chr and ord, things are consistent: ord chr 12345 is 12345 (maybe it should return 12345 % 256?)

      Lbh ebgngrq guvf grkg naq abj lbh pna ernq vg. Fb jung? :) -- Whreq

        >bleadperl -MO=Deparse -e "print length chr 12345" print length "\x{3039}"; -e syntax OK
        This apparently will be fixed in 5.8.0 (due out in May probably), and I imagine the change will be backported to 5.6.2 as well.

        =cut
        --Brent Dax
        There is no sig.

Re: s/.// increases length - bug or badly documented feature
by clemburg (Curate) on Mar 01, 2002 at 09:39 UTC

    I think this is a bug. But it is a documented bug. From the documentation on perlunicode (emphasis added by me):

    The following areas need further work. They are being rapidly addressed in the 5.7.x development branch.
    ...
    Regular Expressions
    The existing regular expression compiler does not produce polymorphic opcodes. This means that the determination on whether to match Unicode characters is made when the pattern is compiled, based on whether the pattern contains Unicode characters, and not when the matching happens at run time. This needs to be changed to adaptively match Unicode if the string to be matched is Unicode.

    To see this, I put in a unicode character in a position guaranteed not to match:

    #!/usr/bin/perl -l $_ = chr(12345); print "Length: ", length; # Length: 1 s/.|[^$_]//; print "Length: ", length; # Length: 2 # prints: # Length: 1 # Length: 0

    Christian Lemburg
    Brainbench MVP for Perl
    http://www.brainbench.com

Re: s/.// increases length - bug or badly documented feature
by Biker (Priest) on Mar 01, 2002 at 08:47 UTC

    This script will reduce the length of the string by one character:

    #!/usr/bin/perl -w use strict; my$str="Biker\n"; print('Length of Biker: '.length($str)."\n"); $str=~s/.//; print('Length of Biker: '.length($str)."\n");
    #BikerString.pl
    Length of Biker: 6
    Length of Biker: 5


    Do you have any code example giving the strange behavior you're referring to?


    Everything will go worng!

      Reducing length is normal behaviour.

      Of course I have example code, but I wanted someone to ask for it first.

      A character can be multiple bytes. Usually, when you want multi-byte characters, you use the utf8 pragma. However, multi-byte characters are already possible in strings without that pragma.
      #!/usr/bin/perl -l $_ = chr(12345); print "Length: ", length; # Length: 1 s/.//; print "Length: ", length; # Length: 2


      Although you can have a multi-byte character without using the pragma, the dot in the regex apparently still uses bytes (if this snipped had "use utf8", the string would be empty after the substitution). Either the regex is wrong using bytes, or length is wrong using characters. Both are documented to deal with characters, but apparently a character and a character are not the same thing.

      chr(12345) is 3 bytes in size, but a single character. With utf8, s/.// removes the first character, and thus all three bytes. Without utf8, s/.// removes the first _BYTE_, leaving two bytes that can't be seen as a single characters, and so making a string of two characters.

      Possible solutions would be:
      1. perl dies on the chr(12345) if utf8 is not used
      2. length returns bytes instead of characters when utf8 is not used
      3. regexes use characters instead of bytes when utf8 is not used

      I think perl's behaviour is buggy. Please correct me if I'm wrong.

      BTW - Please do read perlstyle, because a little extra white space can make your code a lot easier to read

      Lbh ebgngrq guvf grkg naq abj lbh pna ernq vg. Fb jung? :) -- Whreq

        I'm reading from 'perlunicode' here, in my Perl V5.6.0 documentation:

        Important Caveat

        WARNING: The implementation of Unicode support in Perl is incomplete.

        The following areas need further work.

        Input and Output Disciplines
        There is currently no easy way to mark data read from a file or other external source as being utf8. This will be one of the major areas of focus in the near future.

        Regular Expressions
        The existing regular expression compiler does not produce polymorphic opcodes. This means that the determination on whether to match Unicode characters is made when the pattern is compiled, based on whether the pattern contains Unicode characters, and not when the matching happens at run time. This needs to be changed to adaptively match Unicode if the string to be matched is Unicode.

        use utf8 still needed to enable a few features
        The utf8 pragma implements the tables used for Unicode support. These tables are automatically loaded on demand, so the utf8 pragma need not normally be used.
        However, as a compatibility measure, this pragma must be explicitly used to enable recognition of UTF-8 encoded literals and identifiers in the source text.


        "a little extra white space can make your code a lot easier to read"
        Heh! I thought that was easy to read. I explicitly made the code snippet 'readable'. I even put in parentheses around the strings to be printed.
        Anyway, TMTOWTDI and "style" is just a question of... Well, style. ;-)


        Everything will go worng!

        Unicode in regexp (and in hash keys) is known not to work properlly in anything but bleadperl (maybe!). It will be fully supported in 5.8.0 though.

Re: s/.// increases length - bug or badly documented feature
by mattr (Curate) on Mar 01, 2002 at 10:38 UTC
    Interesting, even if polymorphic regex not available yet there is a related regex recompiler (ShiftJIS-Regex) for Shift-JIS Japanese which is 2 bytes per character. I could certainly see that failing to shrink the string with a dot; got some very interesting results when I commented out the use statement once last year.
Re: s/.// increases length - bug or badly documented feature
by steves (Curate) on Mar 01, 2002 at 09:56 UTC

    A note almost identical to the quote above is found on page 409 of The Camel book, 3rd edition.

Re: s/.// increases length - bug or badly documented feature
by erikharrison (Deacon) on Mar 01, 2002 at 18:38 UTC

    All I have is Camel 2nd, and but I remember it stating that in regexes, characters are really just bytes for now( as several people have already mentioned). What I don't understand is why the length increases. I can understand not getting the desired behavior, but deleting a byte from a string should still reduce the length, correct? Or at least keep it the same?

    Cheers,
    Erik