... the real pattern I'm matching is more like (untested)
/(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/
...
What should I have done here to have something always match that date and occasionally also have the variable assignment later in the string?
I've put together some example strings (which you did not provide) and some regexes to try to answer your question.
Consider:
Win8 Strawberry 5.8.9.5 (32) Mon 08/16/2021 20:36:57
C:\@Work\Perl\monks
>perl -Mstrict -Mwarnings
use Data::Dump qw(dd);
for my $s (
'2021-08-16 foo a=bcd bar', '2021-08-17 foo a=bcd',
'2021-08-18 a=bcd', '2021-08-19 a=b',
'2021-08-20 a=', '2021-08-21 ', '2021-08-22', 'xyzzy',
) {
my $matched = $s =~
/(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/ # .* greedy - fails
+ all
# /(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/ # .*? lazy - fails
+some
# /(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/ # works
# /(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/ # works
;
dd $s, $1, $2 if $matched;
# dd $s, $1, $2; # ???
}
^Z
("2021-08-16 foo a=bcd bar", "2021-08-16", undef)
("2021-08-17 foo a=bcd", "2021-08-17", undef)
("2021-08-18 a=bcd", "2021-08-18", undef)
("2021-08-19 a=b", "2021-08-19", undef)
("2021-08-20 a=", "2021-08-20", undef)
("2021-08-21 ", "2021-08-21", undef)
/(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/
(Update: This is the regex quoted above.)
This fails to extract the optional assignment variable in all cases. Why? (Note that the date substring is always properly extracted, as also in all code below.)
.* is greedy and will consume everything (except,
by default, newlines)
to the end of the string. Then (?:a=([a-z]+)) tries to match and cannot because the match point is at the end of the string. That's OK because (?:a=([a-z]+))? is optional
(update: and so the RE need not backtrack);
the overall match can succeed. However, the assignment variable is never captured because .* has already run past it in the string: it's not there to capture.
Next:
>perl -Mstrict -Mwarnings
use Data::Dump qw(dd);
for my $s (
'2021-08-16 foo a=bcd bar', '2021-08-17 foo a=bcd',
'2021-08-18 a=bcd', '2021-08-19 a=b',
'2021-08-20 a=', '2021-08-21 ', '2021-08-22', 'xyzzy',
) {
my $matched = $s =~
# /(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/ # .* greedy - fails
+ all
/(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/ # .*? lazy - fails
+some
# /(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/ # works
# /(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/ # works
;
dd $s, $1, $2 if $matched;
# dd $s, $1, $2; # ???
}
^Z
("2021-08-16 foo a=bcd bar", "2021-08-16", undef)
("2021-08-17 foo a=bcd", "2021-08-17", undef)
("2021-08-18 a=bcd", "2021-08-18", "bcd")
("2021-08-19 a=b", "2021-08-19", "b")
("2021-08-20 a=", "2021-08-20", undef)
("2021-08-21 ", "2021-08-21", undef)
/(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/
(Update: This is the regex quoted above.)
Making
.*? lazy helps a bit, but some of the variables that are present are still not captured. (Again, the date substrings are always captured.)
Failure to capture happens when something like 'foo' is present before the assignment substring. (I assume junk like 'foo' may be present because what's the point of the .* otherwise?) If .*? matches and is immediately followed by (?:a=([a-z]+))?, the assignment will be matched and the variable captured. If there is anything (e.g., 'foo')
following the .*? that is not an assignment substring, the .*? will match and there will be an overall match because (?:a=([a-z]+))? is still completely optional; the assignment variable will not be captured.
What about:
>perl -Mstrict -Mwarnings
use Data::Dump qw(dd);
for my $s (
'2021-08-16 foo a=bcd bar', '2021-08-17 foo a=bcd',
'2021-08-18 a=bcd', '2021-08-19 a=b',
'2021-08-20 a=', '2021-08-21 ', '2021-08-22', 'xyzzy',
) {
my $matched = $s =~
# /(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/ # .* greedy - fails
+ all
# /(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/ # .*? lazy - fails
+some
/(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/ # works
# /(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/ # works
;
dd $s, $1, $2 if $matched;
# dd $s, $1, $2; # ???
}
^Z
("2021-08-16 foo a=bcd bar", "2021-08-16", "bcd")
("2021-08-17 foo a=bcd", "2021-08-17", "bcd")
("2021-08-18 a=bcd", "2021-08-18", "bcd")
("2021-08-19 a=b", "2021-08-19", "b")
("2021-08-20 a=", "2021-08-20", undef)
("2021-08-21 ", "2021-08-21", undef)
/(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/ This is a
lot better. It captures the assignment variable in every case in which it is fully present, even when it's preceded by junk.
The whole (?:.*a=([a-z]+))? expression is optional, but within the expression, the a=([a-z] must match (even if preceded by junk) and if it matches, the variable will be captured.
/(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/ What happens if the .* is changed to .*?, i.e., made lazy? Try it for yourself. Is there any difference in output? Can you explain what's going on?
This is a bit off-topic, but what's with the commented-out
# dd $s, $1, $2; # ???
statement at the end of the code? If you un-comment this statement and comment out the
dd $s, $1, $2 if $matched;
statement that's been used so far, how does the displayed output differ? Do we start to see "dates" extracted from strings from which they should not be extracted, like '2021-08-22' (no required space following the date substring) and 'xyzzy' (no date substring whatsoever)? Why does "dates" have scare-quotes?
What's going on here?
And yes, regexes be tricky.
Give a man a fish: <%-{-{-{-<