http://qs321.pair.com?node_id=11142984

fireblood has asked for the wisdom of the Perl Monks concerning the following question:

Dear wise ones,

I’m unable to constrain the effect of a negative lookahead to just a specific scope. What I want to do is identify in a text value called SYSPBUFF all instances of the pattern “A = B” that occur prior to the first instance of the pattern “batch =”. There are instances of the pattern “A = B” that occur after the first instance of the pattern “batch =” which I don’t want to match.

An example of a typical value of SYSPBUFF is:

run_type = dev, max_monitor_time = 0.25 verbosity_level = 2 batch = ( source = sample_document_collection_1 files = Confucius.docx dest = Enterprise:Department )

I want my regex to capture “run_type = dev”, “max_monitor_time = 0.25”, and “verbosity_level = 2”, because they precede the string “batch =”, but not to capture “source = sample_document_collection_1”, “files = Confucius.docx”, or “dest = Enterprise:Department” because they do NOT precede the string “batch =”.

Each of the patterns “A = B” may be followed by an optional comma.

My regex is applied repeatedly to the value of SYSPBUFF. Every time it finds a match on the pattern “A = B” which precedes the value “batch =” in the value of SYSPBUFF, it saves the captured information, then reconstructs the value of SYSPBUFF as being everything EXCEPT for the string “A = B” that it just matched, and then tries to match again on the revised value of SYSPBUFF. Being able to do this reconstruction is the reason why all parts of the value of SYSPBUFF are captured into capture buffers.

My regex is the following:

/ # Beginning of regex ^ # align regex to start of SYSPBUFF (?: # non-capturing group to limit the effect of the following negative lookahead (?!.*batch\s*=) # negative lookahead, no "batch=" pattern # allowed starting at the start of SYSPBUFF ( # start of capture buffer 1 .*? # a sequence of any chars, non-greedy ) # end of capture buffer 1 ( # start of capture buffer 2 \S+ # one or more non-blank chars – this is the “A” of “A = B” ) # end of capture buffer 2 \s* # optional white space = # an equals sign \s* # optional white space ( # start of capture buffer 3 [^\s\,]+ # one or more chars other than a space or comma -- this is the “B” of “A = B” ) # end of capture buffer 3 # Now account for the optional comma after each of the A = B strings (?: # start of non-capturing optional group \s* # white space after the specified value \, # a comma \s* # optional white space after the comma )? # end of non-capturing optional group ) # end of the negative lookahead limiting # group ( # start of capture group 4 .* # a sequence of any chars, non-greedy batch # the literal string "batch" (not quoted) \s* # optional white space = # an equals sign .* # a sequence of any chars, greedy ) # end of capture group 4 $ # align regex to end of SYSPBUFF /ix # end of regex

What is happening is that the regex fails to match even once. What I suspect the problem is is that I don’t understand how the scope of the negative lookahead (?!.*batch\s*=) can be limited. I had thought that its scope would be confined to within the parentheses that are labeled “non-capturing group to limit the effect of the following negative lookahead”. But what I think is really happening is that when the pattern (?!.*batch\s*=) is encountered, despite being within a parenthesized group, its effect extends beyond those parentheses, effectively setting the condition that from that point in the regex to the end of the regex there can be no “batch =” pattern present. So when the latter part of the regex stipulates that the pattern “batch =” is a mandatory component of that part of the value of SYSPBUFF, the result is an impossible match. First there is the stipulation (?!.*batch\s*=) that the pattern “batch =” must not be present, and then there is the later stipulation that the pattern “batch =” must be present.

Is there a way to specify that a lookahead pattern that begins with the pattern .* applies only to a particular part of a larger pattern after which it is no longer in effect?

Thank you.