http://qs321.pair.com?node_id=282640

I was testing some regular expressions, when I came across some amusing behavior of a regex when compiled with the qr// operator.

Synopsis

my $rx = 'abc'; my $qr = qr/$rx/; if ('ABC' =~ /$qr/i) { print "ABC matches /abc/i\n" } else { print "ABC does not match /abc/i\n" }

If I use the /i modifier, the regex is supposed to match in a case insentitive mode, i.e. "ABC" =~ /abc/i returns a match. However, if I compile the pattern with qr//, the result is different.

This is intriguing, and I have eventually found out why it happens, but before telling you, I would like to show some more examples and let you meditate on what may be happening behind the scenes.

More food for thought

This script shows some variations on the same tune. First a pattern that is applied in case insensitive mode won't match when we would expect it to. Then a pattern in dot-matches-all mode does not match a newline character.

However, when I use a literal pattern instead of a pre-compiled one, it matches.

#!/usr/bin/perl -w use strict; my @patterns = ('abc', 'xyz'); my %regexes = map { $_, qr/$_/} @patterns; my @strings = ('the alphabet starts with ABC', 'the alphabet ends with XYZ' ); for my $str(@strings) { for ( keys %regexes ) { print qq("$str" =~ /$_/i => ); if ($str =~ /$regexes{$_}/i) { print "(qr) match\n"; } else { print "(qr) no match\n" } } } for my $str(@strings) { for ( @patterns ) { print qq("$str" =~ /$_/i => ); if ($str =~ /$_/i) { print "(pattern) match\n"; } else { print "(pattern) no match\n" } } } my $string = <<END; This text spawns across multiple lines END my $pattern = 'multiple .+ lines'; my $regex = qr/$pattern/x; if ($string =~ /$regex/s) { print "dot-matches-all (qr) matches\n"; } else { print "dot-matches-all (qr) does not match\n"; } if ($string =~ /$pattern/xs) { print "dot-matches-all (literal) matches\n"; } else { print "dot-matches-all (literal) does not match\n"; } __END__ "the alphabet starts with ABC" =~ /abc/i => (qr) no match "the alphabet starts with ABC" =~ /xyz/i => (qr) no match "the alphabet ends with XYZ" =~ /abc/i => (qr) no match "the alphabet ends with XYZ" =~ /xyz/i => (qr) no match "the alphabet starts with ABC" =~ /abc/i => (pattern) match "the alphabet starts with ABC" =~ /xyz/i => (pattern) no match "the alphabet ends with XYZ" =~ /abc/i => (pattern) no match "the alphabet ends with XYZ" =~ /xyz/i => (pattern) match dot-matches-all (qr) does not match dot-matches-all (literal) matches

It is puzzling, isnt'it?

OK. Enough suspense. Let's solve the mistery.

Why?

The reason for this behavior is that qr// will compile the pattern with the modifiers we specify at its end. For example, qr/perl/i will happily match "Perl", "perl", and "PERL." The interesting thing that is silently happening, though, is that qr// is setting the /x, /m and /s modifiers as well. If we mention them explicitly, they are operational, if we don't, they are set as non operational. Let's ask Perl itself to unveil the truth.

$ perl -e 'for (qw( i x s m )) {print eval "qr/perl/$_", "\n"}' (?i-xsm:perl) (?x-ism:perl) (?s-xim:perl) (?m-xis:perl)

As you can see, each pattern is compiled as if we had inserted a (?y-z:) block inside a regular expression. For those who don't recall it, such block allows the insertion of a sub-expression with modifiers that only apply within the block's boundaries. Thus, we can insert a case sensitive sub expression within a case insensitive regex. Each modifier following the question mark is set. The ones prepended by a minus sign are unset.

Looking at the outcome of the latest example, we can see that for each modifier that we set explicitly, qr// will implicitly unset the others.

Coming back to our main example, the values in %regexes are (?-xism:abc) and (?-xism:xyz). Keeping in mind the above explanation for sub-expressions, it is clear that this pre-compilation with qr// can't match those patterns. The same is true for the "dot-matches-all" modifier. A pattern compiled with qr//x will end up with (?x-ism:pattern) and even though it is later embedded in a regex with the /s modifier, its matching benefits can't kick in.

Further reading

perlre and perlop are vague about this issue. The only place I've found it mentioned and explained in plain English is Mastering Regular Expressions, 2nd Ed.

Update
Changed title upon Aristotle's suggestion. (Was qr// hidden risks)

 _  _ _  _  
(_|| | |(_|><
 _|   

Replies are listed 'Best First'.
Re: qr// hidden risks
by Aristotle (Chancellor) on Aug 10, 2003 at 16:58 UTC
    I have to agree with perrin, and note that it is documented behaviour. I did have to be bitten once by it to actually pay attention as I RTFM, though, so it can't hurt to point this out more prominently. The node should maybe have been called "Risks in the oblivious use of qr//", though. :)

    Makeshifts last the longest.

Re: qr// hidden risks
by perrin (Chancellor) on Aug 10, 2003 at 16:22 UTC
    I don't mean to burst your bubble, but isn't this sort of obvious? The modifiers are an essential part of any regex. The qr// operator would be pretty lame if it didn't honor the modifiers that you compiled it with.
      The qr// operator would be pretty lame if it didn't honor the modifiers that you compiled it with.

      Yes, but that's a fair bit different than forcibly disabling the modifiers that you omitted...

      What's surprising is not that qr/perl/i produces (?i:perl), but that it produces (?i-xsm:perl), with the other modifiers explicitly disabled.

      That makes plenty of sense, but I'd never seen it explicitly documented before.

      but isn't this sort of obvious?

      Well, this is the whole point. It could not be ovbious for everyone. What is not clearly explained is that qr// will unset the modifiers that we haven't explicitly set.

      I didn't mean to rewrite the Bible :), but just to warn against something that may not be immediately perceived. The problem only exists when you embed a pre-compiled regex within a larger expression.

       _  _ _  _  
      (_|| | |(_|><
       _|   
      

        Well, people should read documentation. If you see "qr" and don't understand what it does, the first thing you should do is read the docs -- don't just assume you know how it works.

        From perldoc -f qr

               qr/STRING/
        
               qx/STRING/
        
               qw/STRING/
                       Generalized quotes.  See the Regexp Quote-Like
                       Operators entry in the perlop manpage.
        
        

        From perldoc perlop

               qr/STRING/imosx
                       This operators quotes--and compiles--its STRING as
                       a regular expression.  STRING is interpolated the
                       same way as PATTERN in `m/PATTERN/'.  If "'" is
                       used as the delimiter, no interpolation is done.
                       Returns a Perl value which may be used instead of
                       the corresponding `/STRING/imosx' expression.
        
        

Re: Risks in the oblivious use of qr//
by chunlou (Curate) on Aug 10, 2003 at 18:40 UTC

    Since this is a logical error, no logical errors are "obvious" a priori when one tries to debug any code except maybe for the 10-line-long ones. One's instinct or debugger does not always readily point him to qr// as one of the potential sources of errors.

Re: Risks in the oblivious use of qr// (warn)
by tye (Sage) on Aug 11, 2003 at 17:52 UTC

    I think the best solution would be to warn when useless quantifiers are specified. (And, having formed this opinion, it is starting to sound familiar and I think I formed this same opinion the last time this came up.)

    That is, /(?-i:hello)/i under -w should complain about something like "regex qualifier i applies to nothing and is ignored" (I hope someone can wordsmith that a bit) while /(?m-x:hi(?-m:lo))/x would complain twice. While /(?i:hi)/i would not complain.

    Note that whether the ignoring of the flag has a practical impact is not important -- the warning is that conflicting flags caused flags to be ignored, not whether the flags had or would have had any effect on the regex parts they (would have) applied to.

                    - tye
Re: Risks in the oblivious use of qr//
by TomDLux (Vicar) on Aug 11, 2003 at 01:15 UTC

    I would expect the call site to be able to override the regex; in fact, the regex overrides the call site.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

      Agreed. I think this is a clear bug and just indicates that when a qr// object is used as an entire expression with different flags, the local flags should override. Similarly, when it is interpolated into a larger expression it should also be subject to local rules.

        hmm, are you saying that:
        $myregex = qr/test/i; if (/$myregex = yes/) {...}
        should or should not be case insensitive for the /test/ part of the regex? to me the lack of options at the end of a regex means dont use them, but it makes sence to me that the precompiled regex would force the flags given to it. I dont see the bug there...

        -Waswas
Re: Risks in the oblivious use of qr//
by halley (Prior) on Aug 11, 2003 at 13:16 UTC
    The first time I printed a qr// as a scalar, I was enlightened. If I would change anything about the current documentation, I'd add: "try print qr/ABC/i as a scalar to see how all of the options are permanently compiled in."

    The odd thing to me is that it bothers to keep the /x option. If you precompile a regex, why not strip all the whitespace and comments and force (?-x)? But that's not a big deal to me. When wearing my app-space hat, I presume the actual compiled regex is hidden and this is just a magic scalar representation.

    --
    [ e d @ h a l l e y . c c ]

      Which is indeed what's going on.
      $ perl -wle'$_=qr/foo/; print ref; print $$_' Regexp Use of uninitialized value in print at -e line 1.
      qr// returns a blessed reference which, well, doesn't point at anything. At least not anything you can see from the Perl end of things.

      Makeshifts last the longest.

      The implementation stores the original string internally to the compiled (and opaque) regex and when it constructs a stringified version it just slaps on a "(?..:" and ")" around the original string. Perl doesn't go to the trouble of actually normallizing the regex - it just outputs what it was originally given.