:punct: vs. {IsPunct} in 5.8

graff has asked for the wisdom of the Perl Monks concerning the following question:

I'm wondering if this is a bug in 5.8's handling of regex character classes, or whether it's just a case of the "perlre" man page being a bit off... (this applies to 5.8.0 and 5.8.1 equally)

Reading perlre, I would expect the following two regexes to match the same set of characters (in the ASCII range, at least), because they are said to be "equivalent":

/[[:punct:]]/
/\p{IsPunct}/
[download]

But when I tried the following little test snippet, I got a bit of a surprise:

for $x ( 0x20 .. 0x7e ) { 
    $_ = chr( $x );
    $res = ( /[[:punct:]]/ ) ? "matches  :punct:" : "is not a :punct:"
+;
    $res .= ( /\p{IsPunct}/ ) ? " matches  {IsPunct}" : " fails on {Is
+Punct}";
    printf( " 0x%x (%3d.) %s %s\n", $x, $x, $_, $res ) if ( $res =~ /m
+atches/ );
}
[download]

Actually, when I look at the output, there seems to be some rhyme and reason to the discrepancies, so it looks like a "feature" (not a bug) to have the two different notions of "punctuation" (and the docs should be updated accordingly):

[:punct:] -- the "posix" notion of punctuation -- is basically the same as ~~[^\x00-\x20\x7f\w]~~ [^0-9A-Za-z\x00-\x20\x7f]
\p{IsPunct} -- the "unicode" notion of punctuation -- refers to things that users of natural languages normally refer to as "punctuation", meaning things that you see/use in text for grouping or separating words in (usually) meaningful ways. (In this sense, it also extends to things outside the ASCII range.)

I wasn't sure which perl mailing list(s) I would post this to, so I decided to check here first.

Comment on :punct: vs. {IsPunct} in 5.8 Select or Download Code

Replies are listed 'Best First'.

Re: [[:punct:]] vs. {IsPunct} in 5.8
by particle (Vicar) on Nov 02, 2003 at 14:26 UTC

for some background, perlre (5.008) states:

    The following equivalences to Unicode \p{} constructs and equivale
+nt
    backslash character classes (if available), will hold:

        [:...:]     \p{...}         backslash

        alpha       IsAlpha
        alnum       IsAlnum
        ascii       IsASCII
        blank       IsSpace
        cntrl       IsCntrl
        digit       IsDigit        \d
        graph       IsGraph
        lower       IsLower
        print       IsPrint
        punct       IsPunct
        space       IsSpace
                    IsSpacePerl    \s
        upper       IsUpper
        word        IsWord
        xdigit      IsXDigit

    <em>For example "[:lower:]" and "\p{IsLower}" are equivalent.</em>
[download]

if your results match mine,

#!/usr/bin/perl
use strict;
use warnings;
$|++;

my %classes= qw/
  alpha IsAlpha
  alnum IsAlnum
  ascii IsASCII
  blank IsBlank
  cntrl IsCntrl
  digit IsDigit
  graph IsGraph
  lower IsLower
  print IsPrint
  punct IsPunct
  space IsSpace
  upper IsUpper
  word IsWord
  xdigit IsXDigit
/;

for( keys %classes )
{
  my( $r_posix, $r_unicode )= ( qr/[[:$_:]]/, qr/\p{$classes{$_}}/ );
  print "testing $r_posix and $r_unicode$/";
  for my $x (0x00..0x7e)
  {
    local $_= chr $x;
    printf "0x%x (%3d.) differ$/" => $x, $x 
      if /$r_posix/ xor /$r_unicode/;
  }
}
__END__
testing (?-xism:[[:digit:]]) and (?-xism:\p{IsDigit})
testing (?-xism:[[:upper:]]) and (?-xism:\p{IsUpper})
testing (?-xism:[[:xdigit:]]) and (?-xism:\p{IsXDigit})
testing (?-xism:[[:cntrl:]]) and (?-xism:\p{IsCntrl})
testing (?-xism:[[:alnum:]]) and (?-xism:\p{IsAlnum})
testing (?-xism:[[:space:]]) and (?-xism:\p{IsSpace})
testing (?-xism:[[:print:]]) and (?-xism:\p{IsPrint})
testing (?-xism:[[:ascii:]]) and (?-xism:\p{IsASCII})
testing (?-xism:[[:word:]]) and (?-xism:\p{IsWord})
testing (?-xism:[[:alpha:]]) and (?-xism:\p{IsAlpha})
testing (?-xism:[[:punct:]]) and (?-xism:\p{IsPunct})
0x24 ( 36.) differ
0x2b ( 43.) differ
0x3c ( 60.) differ
0x3d ( 61.) differ
0x3e ( 62.) differ
0x5e ( 94.) differ
0x60 ( 96.) differ
0x7c (124.) differ
0x7e (126.) differ
testing (?-xism:[[:lower:]]) and (?-xism:\p{IsLower})
testing (?-xism:[[:blank:]]) and (?-xism:\p{IsBlank})
testing (?-xism:[[:graph:]]) and (?-xism:\p{IsGraph})
[download]

then i'd list this as a bug, and contact p5p. it seems only [[:punct:]] and \p{IsPunct} differ. this is not expected behavior.

~Particle *accelerates*

[reply]
[d/l]
[select]

Re: Re: [[:punct:]] vs. {IsPunct} in 5.8

by dakkar (Hermit) on Nov 02, 2003 at 20:45 UTC

It's a bug alright. A documentation bug...

I checked the Unicode properties, and these are the results:

Codepoint	Char	Class
0024	$	Currency Symbol
002B	+	Math Symbol
003C	<	Math Symbol
003D	=	Math Symbol
003E	>	Math Symbol
005E	^	Modifier Symbol
0060	`	Modifier Symbol
007C	\|	Math Symbol
007E	~	Math Symbol

So those are not "punctuation" according to the Unicode standard... Time for a PunctPerl class, to keep company to SpacePerl?

-- 
        dakkar - Mobilis in mobile

Most of my code is tested...

Perl is strongly typed, it just has very few types (Dan)

[reply]

Re: Re: [[:punct:]] vs. {IsPunct} in 5.8

by graff (Chancellor) on Nov 02, 2003 at 17:03 UTC

I have posted the observation to both perl5-porters and perl-unicode mail lists.

[reply]

Re: :punct: vs. {IsPunct} in 5.8
by liz (Monsignor) on Nov 02, 2003 at 09:51 UTC

One good place to ask this would be the perl-unicode@perl.org mailing list.

Liz

[reply]


Welcome to the Monastery
	PerlMonks