Re: Arbitrary number of captures in a regular expression

... and if you still want a one-step approach ...

This must be a job for the /g modifier and its side-kick, \G:

my (@match) = $str =~ /(?:^foo |(?<!^)\G)m (\d+) (?=(?:m \d+ )*bar)/g;
[download]

(The negative lookbehind in order to prevent matching strings starting with the m \d+ pattern — the positive lookahead to prevent matching strings that don't properly close with bar.)

With a little test case, it looks like this:

my @test = ( 'foo m 1 m 2 m 3 m 4 bar',
             'foo m 2 m 4 m 7 bar',
             'foo m 1 bar',
             'm 2 foo m 1 bar',
             'foo m 1 c 2 bar',
             'foo m 1 bar m 2',
             'foo m 1 m 5 m 7',
           );

for my $str (@test) {
  my (@match) = $str =~ /(?:^foo |(?<!^)\G)m (\d+) (?=(?:m \d+ )*bar)/
+g;
  local $" = ', ';
  print "'$str' => (@match)\n";
}
[download]

... and outputs like this:

'foo m 1 m 2 m 3 m 4 bar' => (1, 2, 3, 4)
'foo m 2 m 4 m 7 bar' => (2, 4, 7)
'foo m 1 bar' => (1)
'm 2 foo m 1 bar' => ()
'foo m 1 c 2 bar' => ()
'foo m 1 bar m 2' => (1)
'foo m 1 m 5 m 7' => ()
[download]

Update: Almost missed the "bar" requirement. Fixed now, right?

print "Just another Perl ${\(trickster and hacker)},"
The Sidhekin proves Sidhe did it!

Comment on Re: Arbitrary number of captures in a regular expression Select or Download Code

Replies are listed 'Best First'.
Re^2: Arbitrary number of captures in a regular expression by throop (Chaplain) on Sep 24, 2007 at 04:44 UTC
For those for whom '\G' is deep into 'executable line noise' country: The \G anchor forces the next match to start where the last match left off. Use \G analogously to ^ at the beginning of a string. ^ matches only the beginning of a string – \G matches only the beginning of the string when greedy matching has chewed off the front of the string. perlfaq5 has more detail. (The internal hyperlink at perldoc.perl.org is broken – apparently the backslash discombobulated the escapeHTML routines. But this link will get you there.) The other piece of the puzzle is the '`(?=`'. This handy expression—the 'zero width positive lookahead' (along with its evil twin '`(?!`') are explained in more detail at perlretut. You may also want to review Non-capturing-groupings. Let's take Sidhekin's piece of work apart, and not be quite so terse. As perlretut says Long regexps like this may impress your friends, but can be hard to decipher. In complex situations like this, the //x modifier for a match is invaluable. It allows one to put nearly arbitrary whitespace and comments into a regexp without affecting their meaning. Using it, we can rewrite our 'extended' regexp in the more pleasing form So using the x modifier, the heart of Sidhekin's code becomes `# We're hunting the (properly bracketed) $str =~ # 'm \d+' occurrences. They must be / (?:^foo\s # - proceeded by and initial foo \| # OR (?<!^)\G) # - the end of a previous successful match # - but not the beginning of the string m \s (\d+) # Here's the guy we really want. # But he must be followed by the right stuff (?= # Lookahead says he must be followed by: (?:m \s \d+ \s)* # Any number of m \d+ groups. bar)/xg; # Finally terminated with bar (though not # necessarily the end of string.)` [download] Notice—since whitespace is not significant when using the //x modifier. So where Sidhekin used a single blankspace, I had to use a '`\s`'. This is straightforward way for a programmer to do a greedy capture in the middle of the string. Realize tho that it it not the most straightforward way for the computer. For each 'm \d+' expression in the string, the computer `- starts at the current 'beginning' of the string - matches the 'm \d+' at the current position - matches (fore) the foo and all the 'm \d+' before the current posi +tion - matches (aft) all the remaining 'm \d+' and the final bar - and THROWS AWAY the fore and aft matches (they're non-capturi +ng)` [download] This is a trivial amount of extra work on a single line. But if you are attempting to do something similar by, say, matching across line breaks and pattern searching a set of 120 page MS-Word documents, you may notice some performance problems. Update: added detail	[reply] [d/l] [select]
Re^2: Arbitrary number of captures in a regular expression by demerphq (Chancellor) on Sep 25, 2007 at 10:44 UTC
Id do something like this myself. Except id probably not use look ahead and instead would approach it a different way. (I might even follow up with some code later if i get some time.) --- $world=~s/war/peace/g	[reply]
Re^3: Arbitrary number of captures in a regular expression by Sidhekin (Priest) on Sep 25, 2007 at 11:06 UTC
Id do something like this myself. Except id probably not use look ahead and instead would approach it a different way. I was annoyed with the lookahead myself, but it's unlikely to be a big deal, and I could not at the time see any way to avoid it. After some thinking, however, I believe I see a way to avoid looking ahead more than once -- just include it in the first alternation, which is matched precisely once on a successful match (anchored to the beginning of the string, and the only alternation that can match there): `my (@match) = $str =~ /(?:^foo (?=(?:m \d+ )+bar)\|(?<!^)\G)m (\d+) /g;` [download] ... or, in the less-terse form: `my (@match) = $str =~ / (?: ^foo\ (?= (?:m\ \d+\ )+bar) # overall match from ^foo \| (?<!^) \G # or continue from not-^ ) m\ (\d+)\ # grab each digit sequence /xg;` [download] I think that's the best I got. Match that? :) `print "Just another Perl ${\(trickster and hacker)},"` The Sidhekin proves Sidhe did it!	[reply] [d/l] [select]
Re^4: Arbitrary number of captures in a regular expression by demerphq (Chancellor) on Sep 25, 2007 at 16:12 UTC
I dont have time to put together a working example, but what i had in mind was using while, \G and also the /gc modifier in scalar context. Maybe from that you can come up with a working example, or prove me wrong, before I get the time to do anything useful with it. Using the underused /gc modifier was the key point I was thinking of tho. Oh and to be clear I wasnt trying to say my way would be better, just different. :-) --- $world=~s/war/peace/g	[reply]


Perl: the Markov chain saw
	PerlMonks