Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Making a subpattern optional with ? causes subpattern match to fail. I am confused as to why.

by AnomalousMonk (Archbishop)
on Aug 17, 2021 at 02:20 UTC ( [id://11135893]=note: print w/replies, xml ) Need Help??


in reply to Making a subpattern optional with ? causes subpattern match to fail. I am confused as to why.

... the real pattern I'm matching is more like (untested)

/(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/
...
What should I have done here to have something always match that date and occasionally also have the variable assignment later in the string?

I've put together some example strings (which you did not provide) and some regexes to try to answer your question.

Consider:

Win8 Strawberry 5.8.9.5 (32) Mon 08/16/2021 20:36:57 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings use Data::Dump qw(dd); for my $s ( '2021-08-16 foo a=bcd bar', '2021-08-17 foo a=bcd', '2021-08-18 a=bcd', '2021-08-19 a=b', '2021-08-20 a=', '2021-08-21 ', '2021-08-22', 'xyzzy', ) { my $matched = $s =~ /(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/ # .* greedy - fails + all # /(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/ # .*? lazy - fails +some # /(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/ # works # /(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/ # works ; dd $s, $1, $2 if $matched; # dd $s, $1, $2; # ??? } ^Z ("2021-08-16 foo a=bcd bar", "2021-08-16", undef) ("2021-08-17 foo a=bcd", "2021-08-17", undef) ("2021-08-18 a=bcd", "2021-08-18", undef) ("2021-08-19 a=b", "2021-08-19", undef) ("2021-08-20 a=", "2021-08-20", undef) ("2021-08-21 ", "2021-08-21", undef)
/(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/   (Update: This is the regex quoted above.) This fails to extract the optional assignment variable in all cases. Why? (Note that the date substring is always properly extracted, as also in all code below.)

.* is greedy and will consume everything (except, by default, newlines) to the end of the string. Then (?:a=([a-z]+)) tries to match and cannot because the match point is at the end of the string. That's OK because (?:a=([a-z]+))? is optional (update: and so the RE need not backtrack); the overall match can succeed. However, the assignment variable is never captured because .* has already run past it in the string: it's not there to capture.

Next:

>perl -Mstrict -Mwarnings use Data::Dump qw(dd); for my $s ( '2021-08-16 foo a=bcd bar', '2021-08-17 foo a=bcd', '2021-08-18 a=bcd', '2021-08-19 a=b', '2021-08-20 a=', '2021-08-21 ', '2021-08-22', 'xyzzy', ) { my $matched = $s =~ # /(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/ # .* greedy - fails + all /(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/ # .*? lazy - fails +some # /(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/ # works # /(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/ # works ; dd $s, $1, $2 if $matched; # dd $s, $1, $2; # ??? } ^Z ("2021-08-16 foo a=bcd bar", "2021-08-16", undef) ("2021-08-17 foo a=bcd", "2021-08-17", undef) ("2021-08-18 a=bcd", "2021-08-18", "bcd") ("2021-08-19 a=b", "2021-08-19", "b") ("2021-08-20 a=", "2021-08-20", undef) ("2021-08-21 ", "2021-08-21", undef)
/(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/   (Update: This is the regex quoted above.) Making .*? lazy helps a bit, but some of the variables that are present are still not captured. (Again, the date substrings are always captured.)

Failure to capture happens when something like 'foo' is present before the assignment substring. (I assume junk like 'foo' may be present because what's the point of the .* otherwise?) If .*? matches and is immediately followed by (?:a=([a-z]+))?, the assignment will be matched and the variable captured. If there is anything (e.g., 'foo') following the .*? that is not an assignment substring, the .*? will match and there will be an overall match because (?:a=([a-z]+))? is still completely optional; the assignment variable will not be captured.

What about:

>perl -Mstrict -Mwarnings use Data::Dump qw(dd); for my $s ( '2021-08-16 foo a=bcd bar', '2021-08-17 foo a=bcd', '2021-08-18 a=bcd', '2021-08-19 a=b', '2021-08-20 a=', '2021-08-21 ', '2021-08-22', 'xyzzy', ) { my $matched = $s =~ # /(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/ # .* greedy - fails + all # /(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/ # .*? lazy - fails +some /(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/ # works # /(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/ # works ; dd $s, $1, $2 if $matched; # dd $s, $1, $2; # ??? } ^Z ("2021-08-16 foo a=bcd bar", "2021-08-16", "bcd") ("2021-08-17 foo a=bcd", "2021-08-17", "bcd") ("2021-08-18 a=bcd", "2021-08-18", "bcd") ("2021-08-19 a=b", "2021-08-19", "b") ("2021-08-20 a=", "2021-08-20", undef) ("2021-08-21 ", "2021-08-21", undef)
/(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/   This is a lot better. It captures the assignment variable in every case in which it is fully present, even when it's preceded by junk.

The whole (?:.*a=([a-z]+))? expression is optional, but within the expression, the a=([a-z] must match (even if preceded by junk) and if it matches, the variable will be captured.

/(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/   What happens if the .* is changed to .*?, i.e., made lazy? Try it for yourself. Is there any difference in output? Can you explain what's going on?

This is a bit off-topic, but what's with the commented-out
  # dd $s, $1, $2;  # ???
statement at the end of the code? If you un-comment this statement and comment out the
    dd $s, $1, $2 if $matched;
statement that's been used so far, how does the displayed output differ? Do we start to see "dates" extracted from strings from which they should not be extracted, like '2021-08-22' (no required space following the date substring) and 'xyzzy' (no date substring whatsoever)? Why does "dates" have scare-quotes? What's going on here?

And yes, regexes be tricky.


Give a man a fish:  <%-{-{-{-<

  • Comment on Re: Making a subpattern optional with ? causes subpattern match to fail. I am confused as to why.
  • Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11135893]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-04-19 21:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found