Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: Re: s/.// increases length - bug or badly documented feature

by Juerd (Abbot)
on Mar 01, 2002 at 09:07 UTC ( [id://148514]=note: print w/replies, xml ) Need Help??


in reply to Re: s/.// increases length - bug or badly documented feature
in thread s/.// increases length - bug or badly documented feature

Reducing length is normal behaviour.

Of course I have example code, but I wanted someone to ask for it first.

A character can be multiple bytes. Usually, when you want multi-byte characters, you use the utf8 pragma. However, multi-byte characters are already possible in strings without that pragma.
#!/usr/bin/perl -l $_ = chr(12345); print "Length: ", length; # Length: 1 s/.//; print "Length: ", length; # Length: 2


Although you can have a multi-byte character without using the pragma, the dot in the regex apparently still uses bytes (if this snipped had "use utf8", the string would be empty after the substitution). Either the regex is wrong using bytes, or length is wrong using characters. Both are documented to deal with characters, but apparently a character and a character are not the same thing.

chr(12345) is 3 bytes in size, but a single character. With utf8, s/.// removes the first character, and thus all three bytes. Without utf8, s/.// removes the first _BYTE_, leaving two bytes that can't be seen as a single characters, and so making a string of two characters.

Possible solutions would be:
  1. perl dies on the chr(12345) if utf8 is not used
  2. length returns bytes instead of characters when utf8 is not used
  3. regexes use characters instead of bytes when utf8 is not used

I think perl's behaviour is buggy. Please correct me if I'm wrong.

BTW - Please do read perlstyle, because a little extra white space can make your code a lot easier to read

Lbh ebgngrq guvf grkg naq abj lbh pna ernq vg. Fb jung? :) -- Whreq

Replies are listed 'Best First'.
Re: Re: Re: s/.// increases length - bug or badly documented feature
by Biker (Priest) on Mar 01, 2002 at 09:57 UTC

    I'm reading from 'perlunicode' here, in my Perl V5.6.0 documentation:

    Important Caveat

    WARNING: The implementation of Unicode support in Perl is incomplete.

    The following areas need further work.

    Input and Output Disciplines
    There is currently no easy way to mark data read from a file or other external source as being utf8. This will be one of the major areas of focus in the near future.

    Regular Expressions
    The existing regular expression compiler does not produce polymorphic opcodes. This means that the determination on whether to match Unicode characters is made when the pattern is compiled, based on whether the pattern contains Unicode characters, and not when the matching happens at run time. This needs to be changed to adaptively match Unicode if the string to be matched is Unicode.

    use utf8 still needed to enable a few features
    The utf8 pragma implements the tables used for Unicode support. These tables are automatically loaded on demand, so the utf8 pragma need not normally be used.
    However, as a compatibility measure, this pragma must be explicitly used to enable recognition of UTF-8 encoded literals and identifiers in the source text.


    "a little extra white space can make your code a lot easier to read"
    Heh! I thought that was easy to read. I explicitly made the code snippet 'readable'. I even put in parentheses around the strings to be printed.
    Anyway, TMTOWTDI and "style" is just a question of... Well, style. ;-)


    Everything will go worng!

      There is currently no easy way to mark data read from a file or other external source as being utf8.

      So adding broken unicode-support in a way rendered Perl unusable for external string input. Great! Now we have realy great and fast programming language that can handle text very well, but not if the text has unicode and the utf8 pragma has not been used.

      Is the moral of this story: "don't just always use strict, always use utf8 too"?

      sub byte_length { # depends on bugs no utf8; my ($string) = @_; my $counter; $counter++ while $string =~ s/.//s; return $counter; } sub has_multibytes { my ($string) = @_; return length($string) != byte_length($string); }
      Alternatives for these subs are welcome, of course.

      Lbh ebgngrq guvf grkg naq abj lbh pna ernq vg. Fb jung? :) -- Whreq

Re: Re: Re: s/.// increases length - bug or badly documented feature
by mirod (Canon) on Mar 01, 2002 at 09:47 UTC

    Unicode in regexp (and in hash keys) is known not to work properlly in anything but bleadperl (maybe!). It will be fully supported in 5.8.0 though.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://148514]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2024-04-18 04:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found