Is this a bug in perl regex engine or in my brain?


Think about Loose Coupling
	PerlMonks

Is this a bug in perl regex engine or in my brain?

by nikmit (Sexton)

on Oct 06, 2015 at 15:25 UTC ( [id://1143935]=perlquestion: print w/replies, xml )

Need Help??

nikmit has asked for the wisdom of the Perl Monks concerning the following question:

I know most likely it's in my brain but still, here we go...

I am testing a regular expression and can't quite explain the results to myself with anything but a bug.

my $regex = '(2[0-4]|1?[0-9])?[0-9]';
while (<>) {
    chomp;
    if ($_ =~ /^$regex$/) {
        print "$_ matched\n";
    } else {
        print "$_ did not match\n";
    }
}
[download]

This (as expected) matches digits in the range 1-249.

Changing $regex to ((2[0-4]|1?[0-9])?[0-9])|25[0-5] suddenly matches anything I type as long as it begins with a digit, as if the regex was \d.*

The intended behaviour was to match integers from 1 to 255. What's wrong?

Comment on Is this a bug in perl regex engine or in my brain? Select or Download Code

Replies are listed 'Best First'.

Re: Is this a bug in perl regex engine or in my brain?
by choroba (Cardinal) on Oct 06, 2015 at 15:39 UTC

#!/usr/bin/perl
use warnings;
use strict;

use Test::More;

my $regex1 = qr/(2[0-4]|1?[0-9])?[0-9]/;
my $regex2 = qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/;
for my $n (0 .. 1000) {
    if ($n < 250 || $n > 255) {
        is($n =~ /^$regex1$/, $n =~ /^$regex2$/, "match for $n");
    } else {
        ok($n =~ /^$regex2$/, "match 2nd regex for $n");
        isnt($n =~ /^$regex1$/, $n =~ /^$regex2$/, "match for $n");
    }
}
done_testing();
[download]

لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Re^2: Is this a bug in perl regex engine or in my brain?

by Crackers2 (Parson) on Oct 06, 2015 at 17:58 UTC

$ cat /tmp/x
my $regex = '(2[0-4]|1?[0-9])?[0-9]|25[0-5]';
while (<>) {
    chomp;
    if ($_ =~ /^$regex$/) {
        print "$_ matched\n";
    } else {
        print "$_ did not match\n";
    }
}
$ perl /tmp/x
100
100 matched
200
200 matched
300
300 matched
^C
[download]

my $regex = '(2[0-4]|1?[0-9])?[0-9]|a';
[download]

Aha. Looks like switching from

my $regex = '(2[0-4]|1?[0-9])?[0-9]|25[0-5]';
[download]

my $regex = qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/;
[download]

[reply]
[d/l]
[select]

Re^3: Is this a bug in perl regex engine or in my brain?

by AnomalousMonk (Archbishop) on Oct 06, 2015 at 20:48 UTC

Looks like switching from
my $regex = '(2[0-4]|1?[0-9])?[0-9]|25[0-5]';
to
my $regex = qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/;
seems to fix it. I don't immediately see why though.

It's a regex metacharacter/operator precedence issue.

The regex | (alternation) operator has a low (the lowest?) precedence among regex operators. When a raw string like
my $regex = '(2[0-4]|1?[0-9])?[0-9]|25[0-5]';
is interpolated into
/^$regex$/
the final regex becomes
/^(2[0-4]|1?[0-9])?[0-9]|25[0-5]$/

The ^ start-of-string assertion is effectively grouped and evaluated with the (2[0-4]|1?[0-9])?[0-9] expression and disconnected by the alternation from the 25[0-5]$ expression. IOW, the regex will match any string with a [0-9] at the minimum (everything else is optional) at the start or with a 25[0-5] at the end, and nothing else in the string matters!

c:\@Work\Perl\monks>perl -wMstrict -le
"my $regex = '(2[0-4]|1?[0-9])?[0-9]|25[0-5]';
while (<>) {
    chomp;
    if ($_ =~ /^$regex$/) {
        print qq{'$_' matched};
    } else {
        print qq{'$_' did not match};
    }
}
100
'100' matched
z100
'z100' did not match
z255
'z255' matched
z250
'z250' matched
100z
'100z' matched
99
'99' matched
9999999
'9999999' matched
99Yikes!99
'99Yikes!99' matched
1
'1' matched
11
'11' matched
111
'111' matched
22
'22' matched
222
'222' matched
33
'33' matched
333
'333' matched
[download]

In contrast, choroba used a qr// operator to define the $regex object (in fact, a Regexp object). (Update: See qr// in Regexp Quote-Like Operators in perlop.) This is not the same as a raw string! Among other things, the qr// operator adds a non-capturing (?:pat) group around the whole expression that, in this application, effectively preserves the desired association between start- and end-of-string assertions after interpolation:
my $regex = qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/;
becomes
(?:(2[0-4]|1?[0-9])?[0-9]|25[0-5])
and is interpolated into
/^$regex$/
as
/^(?:(2[0-4]|1?[0-9])?[0-9]|25[0-5])$/
which can be read as "start-of-string, then one of a set of alternations in the range 0-255, then end-of-string" and which gives the desired number range discrimination.

c:\@Work\Perl\monks>perl -wMstrict -le
"my $regex = qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/;
  while (<>) {
      chomp;
      if ($_ =~ /^$regex$/) {
          print qq{'$_' matched};
      } else {
          print qq{'$_' did not match};
      }
  }
"
0
'0' matched
1
'1' matched
100
'100' matched
1000
'1000' did not match
25
'25' matched
255
'255' matched
256
'256' did not match
a1
'a1' did not match
1a
'1a' did not match
11
'11' matched
111
'111' matched
222
'222' matched
333
'333' did not match
[download]

Bottom line: Wherever possible, prefer qr// to raw strings for regex expressions.

Please see perlre, perlretut, and perlrequick.

Update: Incidentally, the regex qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/ does not match the strings 000 001 012 etc. (Update: The regex does match 00 01 02 etc.) If this is an issue, I suggest
qr{ [01]? \d? \d | 2 [0-4] \d | 25 [0-5] }xms
instead, but whatever you use, verify it with something like Test::More as choroba did!

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re^4: Is this a bug in perl regex engine or in my brain?

by nikmit (Sexton) on Oct 07, 2015 at 09:27 UTC

Re: Is this a bug in perl regex engine or in my brain?
by Athanasius (Archbishop) on Oct 06, 2015 at 16:05 UTC

Hello nikmit,

This (as expected) matches digits in the range 1-249.

But it also matches 00, 01, 02, etc. This is easily fixed by removing the first ? quantifier:

my $regex = '(2[0-4]|1[0-9])?[0-9]';
[download]

But I would rather use qr here, together with the /x modifier to make it easier to read:

my $regex = qr{ ( 2[0-4] | 1[0-9] )? [0-9] }x;
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Is this a bug in perl regex engine or in my brain?

by Crackers2 (Parson) on Oct 06, 2015 at 17:49 UTC

my $regex = '(2[0-4]|1[0-9]|[1-9])?[0-9]';
[download]

Re^3: Is this a bug in perl regex engine or in my brain?

by Athanasius (Archbishop) on Oct 07, 2015 at 00:23 UTC

D’oh! You’re right. Good catch! (and good fix, too).

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Is this a bug in perl regex engine or in my brain?
by graff (Chancellor) on Oct 07, 2015 at 02:16 UTC

while (<>) {
    chomp;
    my $result = ( /^([1-9]\d*)$/ and $1 < 256 ) ? "matched" : "did no
+t match";
    $_ .= " $result\n";
    print;
}
[download]

Re^2: Is this a bug in perl regex engine or in my brain? (matching from 0 to 255)

by Discipulus (Canon) on Oct 07, 2015 at 07:28 UTC


      [01]?\d\d?|2[0-4]\d|25[0-5]
[download]

Don't Use Regular Expressions To Parse IP Addresses!

Regex::Common::net

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Back to Seekers of Perl Wisdom

Log In^?

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: perlquestion [id://1143935]
Approved by jellisii2
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others taking refuge in the Monastery: (6)

As of 2024-04-19 20:33 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found