Re^2: Arbitrary number of captures in a regular expression

For those for whom '\G' is deep into 'executable line noise' country:

The \G anchor forces the next match to start where the last match left off. Use \G analogously to ^ at the beginning of a string. ^ matches only the beginning of a string – \G matches only the beginning of the string when greedy matching has chewed off the front of the string.

perlfaq5 has more detail. (The internal hyperlink at perldoc.perl.org is broken – apparently the backslash discombobulated the escapeHTML routines. But this link will get you there.) The other piece of the puzzle is the '(?='. This handy expression—the 'zero width positive lookahead' (along with its evil twin '(?!') are explained in more detail at perlretut.

You may also want to review Non-capturing-groupings.

Let's take Sidhekin's piece of work apart, and not be quite so terse. As perlretut says

Long regexps like this may impress your friends, but can be hard to decipher. In complex situations like this, the //x modifier for a match is invaluable. It allows one to put nearly arbitrary whitespace and comments into a regexp without affecting their meaning. Using it, we can rewrite our 'extended' regexp in the more pleasing form

So using the x modifier, the heart of Sidhekin's code becomes

                # We're hunting the (properly bracketed)
$str =~         # 'm \d+' occurrences.  They must be 
  / (?:^foo\s   # - proceeded by and initial foo
    |           #  OR
    (?<!^)\G)   # - the end of a previous successful match
                # - but not the beginning of the string
     m \s (\d+)  # Here's the guy we really want.
                # But he must be followed by the right stuff
    (?=         # Lookahead says he must be followed by:
     (?:m \s \d+ \s)*   # Any number of m \d+ groups.
     bar)/xg;   # Finally terminated with bar (though not
                #  necessarily the end of string.)
[download]

Notice—since whitespace is not significant when using the //x modifier. So where Sidhekin used a single blankspace, I had to use a '\s'.

This is straightforward way for a programmer to do a greedy capture in the middle of the string. Realize tho that it it not the most straightforward way for the computer. For each 'm \d+' expression in the string, the computer

  - starts at the current 'beginning' of the string
  - matches the 'm \d+' at the current position
  - matches (fore) the foo and all the 'm \d+' before the current posi
+tion
  - matches (aft) all the remaining 'm \d+' and the final bar 
     -  and THROWS AWAY the fore and aft matches  (they're non-capturi
+ng)
[download]

This is a trivial amount of extra work on a single line. But if you are attempting to do something similar by, say, matching across line breaks and pattern searching a set of 120 page MS-Word documents, you may notice some performance problems.

Update: added detail

Comment on Re^2: Arbitrary number of captures in a regular expression Select or Download Code


Don't ask to ask, just ask
	PerlMonks