I am trying to use a regex to match lines that will always have an opening <div tag and could optionally have a closing </div tag on the same line. If the closing </div tag is present, additional code will be executed. Here is a sample that illustrates my problem:
#!/usr/bin/perl -w
use strict;
my $line = "<div id=\"roguebin-response-35911\" class=\"bin-response\"
+></div>";
if ($line =~ /<div.+?(<\/div)*/)
{
printf("line matched\n");
if (defined($1))
{
printf("right after match, 1 is defined\n");
}
}
The output of running the above is:
line matched
I can't figure out why the closing div tag isn't being captured. I thought adding the non-greedy ? would prevent any closing div tags from getting consumed by the .+, but even with that addition the closing div tag isn't being captured.
EDIT: After more searching I found this SO thread which describes the same basic problem I have: https://stackoverflow.com/questions/28782603/regex-optional-capturing-group
After reviewing the example in that thread, I modified my original code to the following, which does work:
#!/usr/bin/perl -w
use strict;
my $line = "<div id=\"roguebin-response-35911\" class=\"bin-response\"
+></div>";
if ($line =~ /<div.+?(<\/div>)*$/)
{
printf("line matched\n");
if (defined($1))
{
printf("right after match, 1 is defined\n");
}
}
I had to add the $ anchor at the end and also the closing > to the optional div capture group. I still don't quite understand how the regex engine is parsing this regex, however:
1. Why is it necessary to add the '>' in order for the capture group to work?
2. If I replace the '*' at the end of the optional capture group with a '?' (non-greedy qualifier), the group is still captured. Are '*' and '?' equivalent when applied to a group?
3. If I omit the '$' from the above regex, the optional div is not captured. The referenced SO thread says this regarding why the regex without the '$' fails to capture the optional group ('cat' changed to 'div' to be consistent with my code):
The reason that you do not get an optional div after a reluctantly-qualified .+? is that it is both optional and non-anchored: the engine is not forced to make that match, because it can legally treat the div as the "tail" of the .+? sequence.
My question is: generally speaking, how does Perl handle the case in which an optional or non-greedy match (.+? in this case) is followed by another optional or non-greedy match ((<\/div>)* in this case)? Does it always prefer to use more characters for one match (i.e. act greedy) rather than make additional matches (when those matches are optional)?