Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Is this a bug in perl regex engine or in my brain?

by nikmit (Sexton)
on Oct 06, 2015 at 15:25 UTC ( [id://1143935]=perlquestion: print w/replies, xml ) Need Help??

nikmit has asked for the wisdom of the Perl Monks concerning the following question:

I know most likely it's in my brain but still, here we go...

I am testing a regular expression and can't quite explain the results to myself with anything but a bug.

my $regex = '(2[0-4]|1?[0-9])?[0-9]'; while (<>) { chomp; if ($_ =~ /^$regex$/) { print "$_ matched\n"; } else { print "$_ did not match\n"; } }
This (as expected) matches digits in the range 1-249.

Changing $regex to ((2[0-4]|1?[0-9])?[0-9])|25[0-5] suddenly matches anything I type as long as it begins with a digit, as if the regex was \d.*

The intended behaviour was to match integers from 1 to 255. What's wrong?

Replies are listed 'Best First'.
Re: Is this a bug in perl regex engine or in my brain?
by choroba (Cardinal) on Oct 06, 2015 at 15:39 UTC
    Works for me. You can also remove one pair of parentheses:
    #!/usr/bin/perl use warnings; use strict; use Test::More; my $regex1 = qr/(2[0-4]|1?[0-9])?[0-9]/; my $regex2 = qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/; for my $n (0 .. 1000) { if ($n < 250 || $n > 255) { is($n =~ /^$regex1$/, $n =~ /^$regex2$/, "match for $n"); } else { ok($n =~ /^$regex2$/, "match 2nd regex for $n"); isnt($n =~ /^$regex1$/, $n =~ /^$regex2$/, "match for $n"); } } done_testing();
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Weird. I can reproduce OPs issue:
      $ cat /tmp/x my $regex = '(2[0-4]|1?[0-9])?[0-9]|25[0-5]'; while (<>) { chomp; if ($_ =~ /^$regex$/) { print "$_ matched\n"; } else { print "$_ did not match\n"; } } $ perl /tmp/x 100 100 matched 200 200 matched 300 300 matched ^C
      In fact adding any "|<something>" seems to trigger it, i.e.
      my $regex = '(2[0-4]|1?[0-9])?[0-9]|a';
      gives the exact same result, and additionally matches anything starting with "a".

      Aha. Looks like switching from

      my $regex = '(2[0-4]|1?[0-9])?[0-9]|25[0-5]';
      to
      my $regex = qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/;
      seems to fix it. I don't immediately see why though.
        Looks like switching from
        my $regex = '(2[0-4]|1?[0-9])?[0-9]|25[0-5]';
        to
        my $regex = qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/;
        seems to fix it. I don't immediately see why though.

        It's a regex metacharacter/operator precedence issue.

        The regex  | (alternation) operator has a low (the lowest?) precedence among regex operators. When a raw string like
            my $regex = '(2[0-4]|1?[0-9])?[0-9]|25[0-5]';
        is interpolated into
            /^$regex$/
        the final regex becomes
            /^(2[0-4]|1?[0-9])?[0-9]|25[0-5]$/

        The  ^ start-of-string assertion is effectively grouped and evaluated with the  (2[0-4]|1?[0-9])?[0-9] expression and disconnected by the alternation from the  25[0-5]$ expression. IOW, the regex will match any string with a  [0-9] at the minimum (everything else is optional) at the start or with a  25[0-5] at the end, and nothing else in the string matters!

        c:\@Work\Perl\monks>perl -wMstrict -le "my $regex = '(2[0-4]|1?[0-9])?[0-9]|25[0-5]'; while (<>) { chomp; if ($_ =~ /^$regex$/) { print qq{'$_' matched}; } else { print qq{'$_' did not match}; } } 100 '100' matched z100 'z100' did not match z255 'z255' matched z250 'z250' matched 100z '100z' matched 99 '99' matched 9999999 '9999999' matched 99Yikes!99 '99Yikes!99' matched 1 '1' matched 11 '11' matched 111 '111' matched 22 '22' matched 222 '222' matched 33 '33' matched 333 '333' matched

        In contrast, choroba used a  qr// operator to define the  $regex object (in fact, a Regexp object). (Update: See  qr// in Regexp Quote-Like Operators in perlop.) This is not the same as a raw string! Among other things, the  qr// operator adds a non-capturing  (?:pat) group around the whole expression that, in this application, effectively preserves the desired association between start- and end-of-string assertions after interpolation:
            my $regex = qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/;
        becomes
            (?:(2[0-4]|1?[0-9])?[0-9]|25[0-5])
        and is interpolated into
            /^$regex$/
        as
            /^(?:(2[0-4]|1?[0-9])?[0-9]|25[0-5])$/
        which can be read as "start-of-string, then one of a set of alternations in the range 0-255, then end-of-string" and which gives the desired number range discrimination.

        c:\@Work\Perl\monks>perl -wMstrict -le "my $regex = qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/; while (<>) { chomp; if ($_ =~ /^$regex$/) { print qq{'$_' matched}; } else { print qq{'$_' did not match}; } } " 0 '0' matched 1 '1' matched 100 '100' matched 1000 '1000' did not match 25 '25' matched 255 '255' matched 256 '256' did not match a1 'a1' did not match 1a '1a' did not match 11 '11' matched 111 '111' matched 222 '222' matched 333 '333' did not match

        Bottom line: Wherever possible, prefer  qr// to raw strings for regex expressions.

        Please see perlre, perlretut, and perlrequick.

        Update: Incidentally, the regex  qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/ does not match the strings  000 001 012 etc. (Update: The regex does match  00 01 02 etc.) If this is an issue, I suggest
            qr{ [01]? \d? \d | 2 [0-4] \d | 25 [0-5] }xms
        instead, but whatever you use, verify it with something like Test::More as choroba did!


        Give a man a fish:  <%-{-{-{-<

Re: Is this a bug in perl regex engine or in my brain?
by Athanasius (Archbishop) on Oct 06, 2015 at 16:05 UTC

    Hello nikmit,

    This (as expected) matches digits in the range 1-249.

    But it also matches 00, 01, 02, etc. This is easily fixed by removing the first ? quantifier:

    my $regex = '(2[0-4]|1[0-9])?[0-9]';

    But I would rather use qr here, together with the /x modifier to make it easier to read:

    my $regex = qr{ ( 2[0-4] | 1[0-9] )? [0-9] }x;

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Your regex no longer matches 10-99. You'd need something like
      my $regex = '(2[0-4]|1[0-9]|[1-9])?[0-9]';
      to fix that.
Re: Is this a bug in perl regex engine or in my brain?
by graff (Chancellor) on Oct 07, 2015 at 02:16 UTC
    I'd be inclined to do it like this:
    while (<>) { chomp; my $result = ( /^([1-9]\d*)$/ and $1 < 256 ) ? "matched" : "did no +t match"; $_ .= " $result\n"; print; }
    That rejects anything with a leading zero (hence "0"), and anything greater that 255.
      I very like this approach. It saves us from the "I had aproblem, i used a regex, now i have two problems" situation.
      If you (the OP) want a review of most possible implications concerning the parsing of an IP address (numbers from 0 to 255 are an octet of an IP address expressed in the dotted decimal form) you can read some paragraph of "Mastering regular expressions" where the proposed result is:
      [01]?\d\d?|2[0-4]\d|25[0-5]
      on the same argument you can find interesting an old thread: Don't Use Regular Expressions To Parse IP Addresses! or some exaples on another site.

      Regex::Common::net is used for the exact purpose and if you feel strong you can dive in it's source code to see how to match an octet.

      L*
      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1143935]
Approved by jellisii2
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2024-04-19 20:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found