Re: s/.// increases length - bug or badly documented feature
by gmax (Abbot) on Mar 01, 2002 at 09:22 UTC
|
Very interesting. I remember reading in the Camel book that characters and bytes are treated consistently in RegExes. Apparently, there is still a grey zone in between.
Before you showed some code, I thought that you were referring to some eval trick:
$_ = '\040';
print "before: ($_) <",eval 'length("$_")',">\n";
s/.//;
print "after: ($_) <",eval 'length("$_")',">\n";
__END__
output:
before: (\040) <1>
after: (040) <3>
What you have found requires further investigation.
_ _ _ _
(_|| | |(_|><
_|
| [reply] [d/l] |
|
Also interesting is that the deparsed code is not equal.
print length chr 12345
outputs "1", and deparses to:
print length "\343\200\271";
which outputs "3" :)
Between chr and ord, things are consistent: ord chr 12345 is 12345 (maybe it should return 12345 % 256?)
Lbh ebgngrq guvf grkg naq abj lbh pna ernq vg. Fb jung? :) -- Whreq
| [reply] [d/l] [select] |
|
>bleadperl -MO=Deparse -e "print length chr 12345"
print length "\x{3039}";
-e syntax OK
This apparently will be fixed in 5.8.0 (due out in May probably), and I imagine the change will be backported to 5.6.2 as well.
=cut
--Brent Dax
There is no sig. | [reply] [d/l] |
Re: s/.// increases length - bug or badly documented feature
by clemburg (Curate) on Mar 01, 2002 at 09:39 UTC
|
I think this is a bug. But it is a documented bug.
From the documentation on perlunicode (emphasis added by me):
The following areas need further work. They are being rapidly addressed in the 5.7.x development branch.
...
Regular Expressions
The existing regular expression compiler does not produce polymorphic opcodes. This means that the determination on whether to match Unicode characters is made when the pattern is compiled, based on whether the pattern contains Unicode characters, and not when the matching happens at run time. This needs to be changed to adaptively match Unicode if the string to be matched is Unicode.
To see this, I put in a unicode character in a position guaranteed not to match:
#!/usr/bin/perl -l
$_ = chr(12345);
print "Length: ", length; # Length: 1
s/.|[^$_]//;
print "Length: ", length; # Length: 2
# prints:
# Length: 1
# Length: 0
Christian Lemburg
Brainbench MVP for Perl
http://www.brainbench.com | [reply] [d/l] |
Re: s/.// increases length - bug or badly documented feature
by Biker (Priest) on Mar 01, 2002 at 08:47 UTC
|
#!/usr/bin/perl -w
use strict;
my$str="Biker\n";
print('Length of Biker: '.length($str)."\n");
$str=~s/.//;
print('Length of Biker: '.length($str)."\n");
#BikerString.pl
Length of Biker: 6
Length of Biker: 5
Do you have any code example giving the strange behavior you're referring to?
Everything will go worng!
| [reply] [d/l] |
|
Reducing length is normal behaviour.
Of course I have example code, but I wanted someone to ask for it first.
A character can be multiple bytes. Usually, when you want multi-byte characters, you use the utf8 pragma. However, multi-byte characters are already possible in strings without that pragma.
#!/usr/bin/perl -l
$_ = chr(12345);
print "Length: ", length; # Length: 1
s/.//;
print "Length: ", length; # Length: 2
Although you can have a multi-byte character without using the pragma, the dot in the regex apparently still uses bytes (if this snipped had "use utf8", the string would be empty after the substitution).
Either the regex is wrong using bytes, or length is wrong using characters. Both are documented to deal with characters, but apparently a character and a character are not the same thing.
chr(12345) is 3 bytes in size, but a single character. With utf8, s/.// removes the first character, and thus all three bytes. Without utf8, s/.// removes the first _BYTE_, leaving two bytes that can't be seen as a single characters, and so making a string of two characters.
Possible solutions would be:
- perl dies on the chr(12345) if utf8 is not used
- length returns bytes instead of characters when utf8 is not used
- regexes use characters instead of bytes when utf8 is not used
I think perl's behaviour is buggy. Please correct me if I'm wrong.
BTW - Please do read perlstyle, because a little extra white space can make your code a lot easier to read
Lbh ebgngrq guvf grkg naq abj lbh pna ernq vg. Fb jung? :) -- Whreq
| [reply] [d/l] [select] |
|
I'm reading from 'perlunicode' here, in my Perl V5.6.0 documentation:
Important Caveat
WARNING: The implementation of Unicode support in Perl is incomplete.
The following areas need further work.
Input and Output Disciplines
There is currently no easy way to mark data read from a file or other external source as being utf8. This will be one of the major areas of focus in the near future.
Regular Expressions
The existing regular expression compiler does not produce polymorphic opcodes. This means that the determination on whether to match Unicode characters is made when the pattern is compiled, based on whether the pattern contains Unicode characters, and not when the matching happens at run time. This needs to be changed to adaptively match Unicode if the string to be matched is Unicode.
use utf8 still needed to enable a few features
The utf8 pragma implements the tables used for Unicode support. These tables are automatically loaded on demand, so the utf8 pragma need not normally be used.
However, as a compatibility measure, this pragma must be explicitly used to enable recognition of UTF-8 encoded literals and identifiers in the source text.
"a little extra white space can make your code a lot easier to read"
Heh! I thought that was easy to read. I explicitly made the code snippet 'readable'. I even put in parentheses around the strings to be printed.
Anyway, TMTOWTDI and "style" is just a question of... Well, style. ;-)
Everything will go worng!
| [reply] |
|
|
| [reply] |
Re: s/.// increases length - bug or badly documented feature
by mattr (Curate) on Mar 01, 2002 at 10:38 UTC
|
Interesting, even if polymorphic regex not available yet there is a related regex recompiler (ShiftJIS-Regex)
for Shift-JIS Japanese which is 2 bytes per character. I could certainly see that failing to shrink the string with a dot; got some very interesting results when I commented out the use statement once last year. | [reply] |
Re: s/.// increases length - bug or badly documented feature
by steves (Curate) on Mar 01, 2002 at 09:56 UTC
|
A note almost identical to the quote above is found on page 409 of The Camel book, 3rd edition.
| [reply] |
Re: s/.// increases length - bug or badly documented feature
by erikharrison (Deacon) on Mar 01, 2002 at 18:38 UTC
|
All I have is Camel 2nd, and but I remember it stating that in regexes, characters are really just bytes for now( as several people have already mentioned). What I don't understand is why the length increases. I can understand not getting the desired behavior, but deleting a byte from a string should still reduce the length, correct? Or at least keep it the same?
Cheers,
Erik
| [reply] |