http://qs321.pair.com?node_id=390085


in reply to On zero-width negative lookahead assertions

First, don't forget to escape @ and ..

>perl -lne "/^root:\s*(?!admin\@somewhere\.here)(.*)/ and print $1" \ aliases.txt someone@somewhere.else admin@somewhere.here

Note the leading space. When the regexp engine failed using all the spaces, it backtracked to \s* matching all but one space. One way to fix it is to anchor it as follows:

>perl -ne "/^root:\s*(?!admin\@somewhere\.here)\S/ and print;" \ aliases.txt root: someone@somewhere.else

Replies are listed 'Best First'.
Re^2: On zero-width negative lookahead assertions
by bronto (Priest) on Sep 10, 2004 at 14:37 UTC

    That works, and I thank you for explaining why. Unfortunately, I can't understand why, in the first case, the parentheses match the leading space, and why putting the \S makes it match, even if there is no non-space character at the end...

    It would be glad if you (or anyone else) could further explain that. I think I'll discover what I didn't understand of zwnla assertions

    Thanks a lot!

    Ciao!
    --bronto


    In theory, there is no difference between theory and practice. In practice, there is.

      Note: Perl regexp matching is not necessarily implemented as described below. I'm totally ignorant as to how it is actually implemented. One could say this document describes the specs rather than the implementation.

      It has nothing to do with lookaheads, really. For example, let's look at
      /^ab*bc/

      The regexp can be read as:
      1. Starting at the begining of the string
      2. Match an 'a'.
      3. Match as many 'b's as possible, but not matching any is ok.
      4. Match a 'b'.
      5. Match a 'c'.

      Match against 'abbbbbbc' 01234567 1) ok! pos = 0. (zw) 2) ok! Found an 'a' at pos 0. pos = 1. 3) ok! Found 6 'b's at pos 1 through 6. pos = 7. 4) fail! Did not find a 'b' at pos 7. Backtrack! 3) ok! Found 5 'b's at pos 1 through 5. pos = 6. 4) ok! Found a 'b' at pos 6. pos = 7. 5) ok! Found a 'c' at pos 7. pos = 8. Match!

      Something similiar is occuring with your
      /^root:\s*(?!email)/

      The regexp can be read as:
      1. Starting at the begining of the string
      2. Match 'root:'.
      3. Match as many '\s's as possible, but not matching any is ok.
      4. Match something other than 'email'.

      Match against 'root: email' 01234567890 1) ok! pos = 0. (zw) 2) ok! Found a 'root:' at pos 0 through 4. pos = 5. 3) ok! Found 1 '\s' at pos 5. pos = 6. 4) fail! Found 'email' at pos 6 through 10. Backtrack! 3) ok! Found 0 '\s' at pos 5. pos = 5. (zw) 4) ok! Found something other than 'email' at pos 5. pos = 5. (zwla) (found ' email') Match!

      Now let's look at my solution
      /^root:\s*(?!email)\S/

      The regexp can be read as:
      1. Starting at the begining of the string
      2. Match 'root:'.
      3. Match as many '\s's as possible, but not matching any is ok.
      4. Match something other than 'email'.
      5. Match a '\S'.

      Match against 'root: email' 01234567890 1) ok! pos = 0. (zw) 2) ok! Found a 'root:' at pos 0 through 4. pos = 5. 3) ok! Found 1 '\s' at pos 5. pos = 6. 4) fail! Found 'email' at pos 6 through 10. Backtrack! 3) ok! Found 0 '\s' at pos 5. pos = 5. (zw) 4) ok! Found something other than 'email' at pos 5. pos = 5. (zwla) (found ' email') 5) fail! Did not find a '\S' at pos 5. Backtrack! Nothing more to try. No match!
      Match against 'root: hisemail' 01234567890123 1) ok! pos = 0. (zw) 2) ok! Found a 'root:' at pos 0 through 4. pos = 5. 3) ok! Found 1 '\s' at pos 5. pos = 6. 4) ok! Found something other than 'email' at pos 6. pos = 6. (zwla) (found 'hisemail') 5) ok! Found a '\S' at pos 6. pos = 6. Match!

      Backtracking means: (might not be an exhaustive list)

      In the case of the first rule
      Look for a match further on.
      In the case of a * rule or ? rule,
      try matching less.
      In the case of a *? rule or ?? rule,
      try matching more.
      In the case of a | or [] rule,
      try matching the next choice.
      else,
      no match, so backtrack the last matching rule.
      The regexp engine will match if it can find any way to. So what you're asking for is "root, followed by some number (possibly zero) of whitespace characters, followed by something that is not 'admin@somewhere.here'". So it matches with root, followed by zero spaces, followed by ' admin@somewhere.here' (with a leading space). Since the string ' admin@somewhere.here' isn't 'admin@somewhere.here' (without the space), the lookahead works. That's why you need the \s* inside the lookahead, making it "try to find spaces followed by admin@somewhere.here, and if you can, fail" instead of "look for spaces, but make sure it's not followed by admin@somewhere.here". Subtle, but important.
        not exactly, not

        "followed by something that is not 'admin@somewhere.here'"

        it is

        "not followed by 'admin@somewhere.here'

        That is a difference, because it matches, if nothing follows at all.

        Uhmmmmm... so the old adagio that "* is greedy" has an exception when zwnlaa come into play; I expected that the \s* had eat all the whitespace before the e-mail address. Ok. Now I am still to understand why that \S thing works...

        Oh, by the way, I am doing:

        perl -i.bak -pe 'BEGIN { $status = 0 } /^root:(?!\s*admin\@somewhere\.here\s*$)/ and $status = 1 ; END { exit $status }' aliases

        and it seems to work great!

        Ciao!
        --bronto


        In theory, there is no difference between theory and practice. In practice, there is.
      There is a non-space character after the \s*. The (?!) part is a zero-width assertion. Zero-width means just that - it doesn't consume anything of the string to match. In stead of using the \S, one could also have used:
      /root:(?>\s*)(?!...)/