Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Unable to constrain the effect of a negative lookahead

by fireblood (Scribe)
on Apr 14, 2022 at 14:38 UTC ( #11142984=perlquestion: print w/replies, xml ) Need Help??

fireblood has asked for the wisdom of the Perl Monks concerning the following question:

Dear wise ones,

I’m unable to constrain the effect of a negative lookahead to just a specific scope. What I want to do is identify in a text value called SYSPBUFF all instances of the pattern “A = B” that occur prior to the first instance of the pattern “batch =”. There are instances of the pattern “A = B” that occur after the first instance of the pattern “batch =” which I don’t want to match.

An example of a typical value of SYSPBUFF is:

run_type = dev, max_monitor_time = 0.25 verbosity_level = 2 batch = ( source = sample_document_collection_1 files = Confucius.docx dest = Enterprise:Department )

I want my regex to capture “run_type = dev”, “max_monitor_time = 0.25”, and “verbosity_level = 2”, because they precede the string “batch =”, but not to capture “source = sample_document_collection_1”, “files = Confucius.docx”, or “dest = Enterprise:Department” because they do NOT precede the string “batch =”.

Each of the patterns “A = B” may be followed by an optional comma.

My regex is applied repeatedly to the value of SYSPBUFF. Every time it finds a match on the pattern “A = B” which precedes the value “batch =” in the value of SYSPBUFF, it saves the captured information, then reconstructs the value of SYSPBUFF as being everything EXCEPT for the string “A = B” that it just matched, and then tries to match again on the revised value of SYSPBUFF. Being able to do this reconstruction is the reason why all parts of the value of SYSPBUFF are captured into capture buffers.

My regex is the following:

/ # Beginning of regex ^ # align regex to start of SYSPBUFF (?: # non-capturing group to limit the effect of the following negative lookahead (?!.*batch\s*=) # negative lookahead, no "batch=" pattern # allowed starting at the start of SYSPBUFF ( # start of capture buffer 1 .*? # a sequence of any chars, non-greedy ) # end of capture buffer 1 ( # start of capture buffer 2 \S+ # one or more non-blank chars – this is the “A” of “A = B” ) # end of capture buffer 2 \s* # optional white space = # an equals sign \s* # optional white space ( # start of capture buffer 3 [^\s\,]+ # one or more chars other than a space or comma -- this is the “B” of “A = B” ) # end of capture buffer 3 # Now account for the optional comma after each of the A = B strings (?: # start of non-capturing optional group \s* # white space after the specified value \, # a comma \s* # optional white space after the comma )? # end of non-capturing optional group ) # end of the negative lookahead limiting # group ( # start of capture group 4 .* # a sequence of any chars, non-greedy batch # the literal string "batch" (not quoted) \s* # optional white space = # an equals sign .* # a sequence of any chars, greedy ) # end of capture group 4 $ # align regex to end of SYSPBUFF /ix # end of regex

What is happening is that the regex fails to match even once. What I suspect the problem is is that I don’t understand how the scope of the negative lookahead (?!.*batch\s*=) can be limited. I had thought that its scope would be confined to within the parentheses that are labeled “non-capturing group to limit the effect of the following negative lookahead”. But what I think is really happening is that when the pattern (?!.*batch\s*=) is encountered, despite being within a parenthesized group, its effect extends beyond those parentheses, effectively setting the condition that from that point in the regex to the end of the regex there can be no “batch =” pattern present. So when the latter part of the regex stipulates that the pattern “batch =” is a mandatory component of that part of the value of SYSPBUFF, the result is an impossible match. First there is the stipulation (?!.*batch\s*=) that the pattern “batch =” must not be present, and then there is the later stipulation that the pattern “batch =” must be present.

Is there a way to specify that a lookahead pattern that begins with the pattern .* applies only to a particular part of a larger pattern after which it is no longer in effect?

Thank you.

Replies are listed 'Best First'.
Re: Unable to constrain the effect of a negative lookahead
by johngg (Canon) on Apr 14, 2022 at 15:50 UTC

    Instead of a complex and possibly difficult to maintain regex I would approach the problem a different way. I would treat $SYSPBUFF as a file and read it line by line until I reached the "batch =" line, at which point I'd last out of the loop to cease reading any further. Lines of interest would be pushed onto an array ready for further processing. I haven't tried tidying the lines up, leading spaces, trailing comma etc., I'll leave that to the reader.

    johngg@abouriou:~/perl/Monks$ perl -Mstrict -Mwarnings -E 'say q{}; my $SYSPBUFF = <<__EOD__; run_type = dev, max_monitor_time = 0.25 verbosity_level = 2 batch = ( source = sample_document_collection_1 files = Confucius.docx dest = Enterprise:Department ) __EOD__ open my $inFH, q{<}, \ $SYSPBUFF or die qq{open: < in mem data: $!\n}; my @lines; while ( <$inFH> ) { last if m{batch\s*\=}; next if m{^\s*$}; chomp; push @lines, $_; } close $inFH or die qq{close: < in mem data: $!\n}; say for @lines;' run_type = dev, max_monitor_time = 0.25 verbosity_level = 2

    I hope this is helpful.

    Cheers,

    JohnGG

Re: Unable to constrain the effect of a negative lookahead
by tybalt89 (Monsignor) on Apr 14, 2022 at 18:21 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11142984 use warnings; $_ = <<END; run_type = dev, max_monitor_time = 0.25 verbosity_level = 2 batch = ( source = sample_document_collection_1 files = Confucius.docx dest = Enterprise:Department ) END my @want = grep defined, /\bbatch =/ && # there is a 'batch =' /^(?:\h*batch\s+=[\s\S]*\z # skip rest of string | \h*(.*=.*?),?\h*\n) # the pattern we want /gmx; use Data::Dump 'dd'; dd \@want;

    Outputs:

    [ "run_type = dev", "max_monitor_time = 0.25", "verbosity_level = 2", ]
Re: Unable to constrain the effect of a negative lookahead
by AnomalousMonk (Archbishop) on Apr 14, 2022 at 20:17 UTC

    Here's another approach that avoids look-arounds.

    Win8 Strawberry 5.30.3.1 (64) Thu 04/14/2022 15:59:05 C:\@Work\Perl\monks >perl use strict; use warnings; use Data::Dump qw(dd); my $rx_ab = qr{ A \s* = \s* [BCDEFGH] }xms; my $rx_stop = qr{ batch \s* = }xms; my @strings = ( 'A = B foo A=C batch = bar A = D', 'A = B foo A=C bar A =D', ); for my $s (@strings) { my @captures = $s =~ m{ ($rx_ab) | $rx_stop (*COMMIT) (*FAIL) # wrong order? see upd +ate below. }xmsg; dd \@captures; } ^Z ["A = B", "A=C"] ["A = B", "A=C", "A =D"]

    You will have to elaborate the $rx_ab and $rx_stop regexes until they match what you want. I think the | The (*COMMIT) special backtracking control verb is available from Perl version 5.10 onward (update: and is non-experimental from version 5.20 on).

    Update: Because of the way Perl ordered alternations work, it seems to me that the order of the alternation used above should be
        $rx_stop (*COMMIT) (*FAIL) | ($rx_ab)
    This eliminates the possibility that the A pattern can accidentally match the 'batch' stop marker.


    Give a man a fish:  <%-{-{-{-<

Re: Unable to constrain the effect of a negative lookahead
by hv (Parson) on Apr 15, 2022 at 02:39 UTC

    But what I think is really happening is that when the pattern (?!.*batch\s*=) is encountered, despite being within a parenthesized group, its effect extends beyond those parentheses, effectively setting the condition that from that point in the regex to the end of the regex there can be no “batch =” pattern present.

    Close - but it's "from that point in the string to the end of the string". The fragment (?!.*batch\s*=) means "match at this point in the string only if at this point we do not satisfy /.*batch\s*=/". There's no scoping going on - the whole of the string (from the point we've matched to so far) is fair game. So this fragment can only match locations in the string that don't have a "batch =" anywhere after them, and since the second half of the pattern requires "batch", we have a contradiction and the whole will never match.

      Hi hv, Yes, you stated it quite well, that's what I meant to say but was sloppy in using the phrase "that point in the regex" instead of "that point in the string". Thanks for the correction. And you have also confirmed what I suspected was the problem, that because the effect of the /?!.*batch/ was over the entire string rather than the part of the string to which the parenthesized expression applied (such as (?i) for temporarily matching case insensitively) the expression as a whole contained a contradiction and the whole would never match. -fireblood
Re: Unable to constrain the effect of a negative lookahead
by Anonymous Monk on Apr 14, 2022 at 16:29 UTC

    This looks to me like an X-Y problem: you are trying to do X (which I presume is capture the values defined in a string), have decided the way to do this is Y (pull the string apart and rebuild it as you go until you hit some sentinel value), and are having trouble doing this.

    To answer the question you asked: if you want any part of a regex to match only under certain circumstances, you put it in the regular expression where those circumstances apply. Your negative lookahead is anchored to the start of the string by '^', and will match only there. As for why the whole regular expression fails to match, you can try use re 'debug'; to give a somewhat cryptic insight on what the regex is trying.

    But if my X-Y assumption is correct, I would implement differently, with a much simpler and more comprehensible regular expression.

    #!/usr/bin/env perl
    
    use 5.010;	# For branch reset
    use strict;
    use warnings;
    
    my $SYSPBUFF = <<'EOD';
    		run_type = dev,
    		max_monitor_time = 0.25
    		verbosity_level = 2
    
    		batch =
    
    			(
    				source = sample_document_collection_1
    				files = Confucius.docx
    				dest = Enterprise:Department
    			)
    EOD
    
    while ( $SYSPBUFF =~ m/
        (?|		# Branch reset
    	\s* ( batch ) \s* = \s* ( .* ) |	# Match batch = ...
    	\s* ( [^\s=]+ ) \s* = \s* ( [^\s,]+ )	# Match A = B
        )
        /smxg
    ) {
        print "Captured $1 = $2\n";
    }
    

    produces

    Captured run_type = dev
    Captured max_monitor_time = 0.25
    Captured verbosity_level = 2
    Captured batch = (
    				source = sample_document_collection_1
    				files = Confucius.docx
    				dest = Enterprise:Department
    			)
    
    

    Is this the sort of thing you are after?

    What the branch reset ((?| ... | ... | ... )) does is to re-use capture buffer numbers. In this specific regular expression I used it so that the name of the value would always appear in $1 and the value itself in $2, no matter which branch of the alternation matched.

    Note that there is no explicit termination logic for the loop. I understood your problem statement to imply that the batch = was the last thing in the input, so that once you saw it you wanted the entire remainder of the input. If this is wrong the regex will need to match a parenthesis-delimited string. If the parentheses can be nested, that is more complicated. Module Regexp::Common can be helpful here.

Re: Unable to constrain the effect of a negative lookahead
by Fletch (Bishop) on Apr 15, 2022 at 13:24 UTC

    Just an (almost) meta comment: Trying to do something like this with a single regexp is when you may start getting into "Now you have two problems" territory. As has been shown you can do this in one pass with deep regex engine magic, but were I to do it I'd instead break it down into a simple state machine and parse things line-by-line that way instead. And were the "language" being handled much more complicated then I'd look to use a proper parser module (Parse::RecDescent, maybe Marpa::R2) and let that write the state machine. If/when things change about the "language" I just would need to update the higher level grammar description instead of manually monkeying in more states.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: Unable to constrain the effect of a negative lookahead
by Marshall (Canon) on Apr 17, 2022 at 03:27 UTC
    Here is another technique for you. Just use a split to get the part of the string before the "batch =", then an unconstrained regex match global upon that. Yes, this does have to process the string twice, but the code is clear, short and does not require fancy regex features.

    use strict; use warnings; my $SYSPBUFF = <<__EOD__; run_type = dev, max_monitor_time = 0.25 verbosity_level = 2 batch = ( source = sample_document_collection_1 files = Confucius.docx dest = Enterprise:Department ) __EOD__ my ($top) = split(/\s*batch\s*=/,$SYSPBUFF,2); my %hash = $top =~ m/([\w\.]+)\s*=\s*([\w\.]+)/g; print "$_ => $hash{$_}\n" for (keys %hash); __END__ max_monitor_time => 0.25 verbosity_level => 2 run_type => dev
Re: Unable to constrain the effect of a negative lookahead
by fireblood (Scribe) on Apr 15, 2022 at 12:14 UTC
    Dear JohnGG, tybalt89, Anonymous Monk, AnomalousMonk, and hv,

    Thanks so much for your replies. I had never considered in-lining the string and then processing it as a file, but it's clearly a very elegant solution, not only by virtue of solving the problem but also making the code a lot clearer and easier to read and maintain. And to Anonymous, I was not aware of the technique that your code shows whereby capture buffer numbers are reused so that regardless of which alternative matched, the capture buffer numbers are still the same. Very cool. And thanks for the referral to Regex::Common, a good resource.

    And to Anomalous, wow, thanks for the reference to the special backtracking control verbs, that's another new area for me to explore.

    My understanding of how the voting works when there are multiple responders is that I click on the O++, O--, or O+=0 buttons to allocate points across the responders, and these are limited to awarding a single upvote, a single downvote, or a single nothing vote respectively, there's no mechanism for the OP to award 50% to one, 25% to another, and so forth. But other readers can come along, read through all of the responses, and add their single upvotes or downvotes, and eventually with enough additional readers chiming in, the distributions of 50%, 25%, etc. will emerge from the population of voters rather than from a single voter. And I guess that the box at the bottom of all of the posts labeled vote! is for awarding a point to the discussion as a whole if it seems meritorious. I'm still learning about how the voting system works.

    Thanks again to all!
      And I guess that the box at the bottom of all of the posts labeled vote! is for awarding a point to the discussion as a whole if it seems meritorious.

      The "vote!" button is for submitting all the votes you have allocated with the radio buttons previously. Just selecting the radio buttons without subsequently pressing "vote!" does nothing at all.

      In all other aspects you are correct. See the section entitled "How do I vote?" in Voting/Experience System for more info.


      🦛

Re: Unable to constrain the effect of a negative lookahead
by Polyglot (Hermit) on Apr 29, 2022 at 22:20 UTC

    In the spirit of TIMTOWTDI, here's yet another method to handle this, one that I have often used with complex situations. The code might look something like this (untested).

    my $replace = sub { my $prefix = $1; #COLLECT YOUR PAIRS SOMEHOW; A m// OR s/// ALSO POSSIBLE my @pairs = split(/(.*?[=].*?)(?:[,])?/, $prefix); #DEAL WITH PAIRS HOWEVER YOU LIKE push @allpairs, @pairs; #RETURN WHATEVER REPLACEMENT YOU LIKE FOR $prefix return $prefix; #NO CHANGE AT ALL }; #CAPTURE EVERYTHING UP TO, NOT INCLUDING, "batch =" $syspbuff =~ s/(.*?)(?=batch\s*=)/$replace->()/em;

    Blessings,

    ~Polyglot~

      Hi Polyglot,

      That's a brilliant approach. I had not thought about setting the /e regex qualifier and then putting code to be executed in the regex upon the finding of a match. That certainly opens the door to a broad span of possibilities. I had gone with the approach of splitting $SYSPBUFF into two parts one before and one beginning with the first presence of "batch =". The /e regex qualifier approach has much appeal in that with the (?=batch\s*=) lookahead it already automatically does that splitting on the fly while preserving the value of $SYSPBUFF other than for the specific subpatterns being sought. I will try creating a new version of my real world project code using the approach you've put forth.

      fireblood

      P.S.Just out of curiosity, how many languages do you speak and what are they? I'm a native speaker of US English and a tiny fraction of Cherokee, and also took courses in German, Chinese, Hebrew, and Vietnamese in high school and college. Each language studying experience not only provides access to the language, but also to the music, food, dress, customs, and other aspects of others' cultures. I think that people who are polyglottal are also polycultural.

      Cheers!

        fireblood,

        Yes, every language is a culture. I'm a native speaker of English, of course, and perhaps one could say a "half-native" speaker of Spanish (fluent and have native accent due to early childhood exposure). Beyond that, I speak/read Thai, Lao, and elemental Mandarin, can type Burmese Karen, am learning Hmong, can understand about 60-80% of written French (I studied it some in college), and I've completed multiple courses in the Biblical languages, i.e. Ancient Hebrew, Greek, and Aramaic. (Hebrew is fascinating, despite its complexity. Chinese and Hebrew are both very difficult to master, but I would rank Hebrew as more difficult than Chinese. The only difficulty with Chinese is its 86,000+ ideograms; its grammar doesn't hold a candle to Hebrew for complexity.) Of course, these are only "human" languages--and I think I could add a few computer languages to the list. It seems much of my life has revolved around language. When it comes to programming, I feel rather amateur; I am limited to the realm of the concrete, as I seem to have no understanding of the abstraction (think OOP: objects, references, and the like). Naturally, as with most of the human languages, I have picked up computer programming all on my own. Maybe if I had taken courses in it I would be better off.

        In general, Asian languages are inferior, despite their surface complexity. They lack plurals, have no verb conjugation, and most lack even word-spacing between words. Asian languages seem to be just beginning to adopt punctuation, and their vocabulary is often limited to the real and concrete, with many deficiencies showing up in the philosophical and abstract terminologies. For example, there are no words for "character" or "soul" or "faith" in Thai or Lao, and even their word for "God" is just borrowed from the word for "king." Thai and Lao have no: of, lest, never, either, neither, nor, etc. ("never" can be used for a non-event in the past, but there is no way to address the future by this concept), and there is no grammatical structure for differentiating between restrictive and non-restrictive clauses. Essentially, many concepts are non-translatable into the language, so "lost in translation" takes on a whole new meaning. There is no word for "brother" or "sister" because those words imply equality, and there is no equality within the culture. One is forced by the language to specify "older brother" or "younger brother"; or to imply a plurality as in "older-younger" = brothers and sisters. Complex concepts are made by the combining of simple words. In Chinese, saying "East-West" (dtong-shi/東西) is how one expresses "things." In Thai/Lao, saying "is go not can" is how one says "impossible." With any of the unspaced alphabetic Asian languages, speed reading is impossible, and reading in general is discouraged within the language. Very few Thai or Lao people enjoy reading, and their education suffers as a result. Chinese, being a different writing system, may be a little easier to read, once learned; but one must study the characters continuously through school, and no one knows them all. Even among the Chinese, I have yet to hear of a speed reader. I wonder if it is possible.

        Chinese translators have to be good at math, too. Chinese figures are chunked by 10,000's, whereas Western figures are divided by 1,000's. It can take a few moments to perform the mental conversions between them: for example, 70 thousand becomes 7 ten-thousands --> that's an easy example; but it gets more complex when one adds more digits....say, 7.5 million (7,500,000 --> 750 ten-thousands).

        Language is certainly intriguing at times.

        Blessings,

        ~Polyglot~

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11142984]
Approved by LanX
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2022-12-01 16:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?