Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

:punct: vs. {IsPunct} in 5.8

by graff (Chancellor)
on Nov 01, 2003 at 18:12 UTC ( [id://303819]=perlquestion: print w/replies, xml ) Need Help??

graff has asked for the wisdom of the Perl Monks concerning the following question:

I'm wondering if this is a bug in 5.8's handling of regex character classes, or whether it's just a case of the "perlre" man page being a bit off... (this applies to 5.8.0 and 5.8.1 equally)

Reading perlre, I would expect the following two regexes to match the same set of characters (in the ASCII range, at least), because they are said to be "equivalent":

/[[:punct:]]/ /\p{IsPunct}/
But when I tried the following little test snippet, I got a bit of a surprise:
for $x ( 0x20 .. 0x7e ) { $_ = chr( $x ); $res = ( /[[:punct:]]/ ) ? "matches :punct:" : "is not a :punct:" +; $res .= ( /\p{IsPunct}/ ) ? " matches {IsPunct}" : " fails on {Is +Punct}"; printf( " 0x%x (%3d.) %s %s\n", $x, $x, $_, $res ) if ( $res =~ /m +atches/ ); }
Actually, when I look at the output, there seems to be some rhyme and reason to the discrepancies, so it looks like a "feature" (not a bug) to have the two different notions of "punctuation" (and the docs should be updated accordingly):
  • [:punct:] -- the "posix" notion of punctuation -- is basically the same as [^\x00-\x20\x7f\w]  [^0-9A-Za-z\x00-\x20\x7f]
  • \p{IsPunct} -- the "unicode" notion of punctuation -- refers to things that users of natural languages normally refer to as "punctuation", meaning things that you see/use in text for grouping or separating words in (usually) meaningful ways. (In this sense, it also extends to things outside the ASCII range.)
I wasn't sure which perl mailing list(s) I would post this to, so I decided to check here first.

Replies are listed 'Best First'.
Re: [[:punct:]] vs. {IsPunct} in 5.8
by particle (Vicar) on Nov 02, 2003 at 14:26 UTC

    for some background, perlre (5.008) states:

    The following equivalences to Unicode \p{} constructs and equivale +nt backslash character classes (if available), will hold: [:...:] \p{...} backslash alpha IsAlpha alnum IsAlnum ascii IsASCII blank IsSpace cntrl IsCntrl digit IsDigit \d graph IsGraph lower IsLower print IsPrint punct IsPunct space IsSpace IsSpacePerl \s upper IsUpper word IsWord xdigit IsXDigit <em>For example "[:lower:]" and "\p{IsLower}" are equivalent.</em>

    if your results match mine,

    #!/usr/bin/perl use strict; use warnings; $|++; my %classes= qw/ alpha IsAlpha alnum IsAlnum ascii IsASCII blank IsBlank cntrl IsCntrl digit IsDigit graph IsGraph lower IsLower print IsPrint punct IsPunct space IsSpace upper IsUpper word IsWord xdigit IsXDigit /; for( keys %classes ) { my( $r_posix, $r_unicode )= ( qr/[[:$_:]]/, qr/\p{$classes{$_}}/ ); print "testing $r_posix and $r_unicode$/"; for my $x (0x00..0x7e) { local $_= chr $x; printf "0x%x (%3d.) differ$/" => $x, $x if /$r_posix/ xor /$r_unicode/; } } __END__ testing (?-xism:[[:digit:]]) and (?-xism:\p{IsDigit}) testing (?-xism:[[:upper:]]) and (?-xism:\p{IsUpper}) testing (?-xism:[[:xdigit:]]) and (?-xism:\p{IsXDigit}) testing (?-xism:[[:cntrl:]]) and (?-xism:\p{IsCntrl}) testing (?-xism:[[:alnum:]]) and (?-xism:\p{IsAlnum}) testing (?-xism:[[:space:]]) and (?-xism:\p{IsSpace}) testing (?-xism:[[:print:]]) and (?-xism:\p{IsPrint}) testing (?-xism:[[:ascii:]]) and (?-xism:\p{IsASCII}) testing (?-xism:[[:word:]]) and (?-xism:\p{IsWord}) testing (?-xism:[[:alpha:]]) and (?-xism:\p{IsAlpha}) testing (?-xism:[[:punct:]]) and (?-xism:\p{IsPunct}) 0x24 ( 36.) differ 0x2b ( 43.) differ 0x3c ( 60.) differ 0x3d ( 61.) differ 0x3e ( 62.) differ 0x5e ( 94.) differ 0x60 ( 96.) differ 0x7c (124.) differ 0x7e (126.) differ testing (?-xism:[[:lower:]]) and (?-xism:\p{IsLower}) testing (?-xism:[[:blank:]]) and (?-xism:\p{IsBlank}) testing (?-xism:[[:graph:]]) and (?-xism:\p{IsGraph})

    then i'd list this as a bug, and contact p5p. it seems only [[:punct:]] and \p{IsPunct} differ. this is not expected behavior.

    ~Particle *accelerates*

      It's a bug alright. A documentation bug...

      I checked the Unicode properties, and these are the results:

      CodepointCharClass
      0024$Currency Symbol
      002B+Math Symbol
      003C<Math Symbol
      003D=Math Symbol
      003E>Math Symbol
      005E^Modifier Symbol
      0060`Modifier Symbol
      007C|Math Symbol
      007E~Math Symbol

      So those are not "punctuation" according to the Unicode standard... Time for a PunctPerl class, to keep company to SpacePerl?

      -- 
              dakkar - Mobilis in mobile
      

      Most of my code is tested...

      Perl is strongly typed, it just has very few types (Dan)

      Thanks for such a nicely crafted verification. (I wanted to check the other POSIX vs. Unicode classes as well, so you saved me some trouble -- and shown a neat approach!)

      I have posted the observation to both perl5-porters and perl-unicode mail lists.

Re: :punct: vs. {IsPunct} in 5.8
by liz (Monsignor) on Nov 02, 2003 at 09:51 UTC
    I think your assessment is correct, but I don't have that much experience with Unicode regexes.

    One good place to ask this would be the perl-unicode@perl.org mailing list.

    Liz

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://303819]
Approved by Corion
Front-paged by Courage
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (2)
As of 2024-04-19 20:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found