http://qs321.pair.com?node_id=973238

bronto has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Monks

I am trying to solve a problem which seems to require negative lookbehind regexes, which I never used. And, based on the results I am getting, it looks like I don't understand a **** of it :) I hope you can easily point out where my reasoning is failing.

I have a "normalized" hosts file; each line matches one of the following patterns:

^\s*$ (empty line) ^\s*#.*$ (comment line) ^[a-fA-F0-9:\.]+(\s+[a-zA-Z0-9\.-]+)+\s*$ (host record)

(The third regex could be narrowed down quite a bit, but let's KISS).

Now, in order to delete spurious lines, I would like to delete all those records which contain the host name (qualified or unqualified), but don't start with the right address of the machine (either the "local" version, 127.0.1.1 or, say, 10.20.11.99).

I tried several ways to match a whole line with no success, so I tried to start over simple. So far, I got only a partial check working (partial as in: covers only one, very specific subcase), namely:

(?<!10\.20\.11\.99\s)kvm-test-v06.+

and I must say I am quite surprised that a slightly different pattern doesn't match (or actually matches more than expected):

(?<!10\.20\.11\.99\s)kvm-test-v06.*

All other attempts to match a whole line, containing fully-qualified/unqualified name of the host somewhere, but preceded by an address other than 127.0.1.1 or 10.20.11.99 failed badly.

I am sure I am missing something fundamental and stupid, but I can't see that. Can you help me find it out?

A sample hosts file to test against:

127.0.0.1 localhost 10.20.11.99 kvm-test-v06.example.com kvm-test-v06 # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters # Spurious records 1.2.3.4 litter trash kvm-test-v06.example.com garbage 1.2.3.5 kvm-test-v06 more garbage 1.2.3.6 litter kvm-test-v06 1.2.4.7 kvm-test-v06.example.com kvm-test-v06

Thanks in any case

Ciao!
--bronto


In theory, there is no difference between theory and practice. In practice, there is.
pmsig

Replies are listed 'Best First'.
Re: Misunderstanding negative look behind
by BrowserUk (Patriarch) on May 30, 2012 at 10:25 UTC

    Untested speculation, but based purely upon reading your post, I'd guess that the problem is that this:

    (?<!10\.20\.11\.99\s)kvm-test-v06

    is anticipating exactly 1 whitespace char between the IP address and the name and perhaps some of your lines have more than one?

    Does this: (?<!10\.20\.11\.99)\s+kvm-test-v06 more closely meet you needs?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Misunderstanding negative look behind
by moritz (Cardinal) on May 30, 2012 at 12:26 UTC

    You never show where exactly in a regex you insert your look-behind, so I can only get what you're doing wrong.

    If you put it at the start of a regex, you need to a negative look-ahead, because the IP address is not yet matched. Example:

    use strict; use warnings; use 5.010; my $localhost = qr/(?!10\.20\.11\.99\s)/; while (<DATA>) { next if /^\s*$/; next if /^\s*#.*$/; if (/^$localhost([a-fA-F0-9:\.]+)(\s+[a-zA-Z0-9\.-]+)+\s*$/) { say "IP: $1; HOST: $2"; } else { chomp; say "No match for '$_'"; } } __DATA__ 127.0.0.1 localhost 10.20.11.99 kvm-test-v06.example.com kvm-test-v06 # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters # Spurious records 1.2.3.4 litter trash kvm-test-v06.example.com garbage 1.2.3.5 kvm-test-v06 more garbage 1.2.3.6 litter kvm-test-v06 1.2.4.7 kvm-test-v06.example.com kvm-test-v06

    Produces the output:

    IP: 127.0.0.1; HOST: localhost No match for '10.20.11.99 kvm-test-v06.example.com kvm-te +st-v06' IP: ::1; HOST: ip6-loopback IP: fe00::0; HOST: ip6-localnet IP: ff00::0; HOST: ip6-mcastprefix IP: ff02::1; HOST: ip6-allnodes IP: ff02::2; HOST: ip6-allrouters IP: 1.2.3.4; HOST: garbage IP: 1.2.3.5; HOST: garbage IP: 1.2.3.6; HOST: kvm-test-v06 IP: 1.2.4.7; HOST: kvm-test-v06

    Note that this uses the negative look-ahead directly after the ^ anchor, so that the look-ahead never backtracks.

    I would like to delete all those records which contain the host name (qualified or unqualified), but don't start with the right address of the machine

    It is much easier to do such a check after you have parsed the line, and extracted host name and IP address.

Re: Misunderstanding negative look behind
by cosimo (Hermit) on May 30, 2012 at 11:00 UTC

    Maybe what you want is positive look-behind instead of negative?

    $ perl -lane 'print if /(?<=10\.20\.11\.99\s)kvm-test-v06.*/' your-hos +ts-file 10.20.11.99 kvm-test-v06.example.com kvm-test-v06
Re: Misunderstanding negative look behind
by bronto (Priest) on May 31, 2012 at 11:31 UTC

    First of all, thanks to all those who commented. Unfortunately, nobody really captured what my problem is, and the answers don't match the question. Usually, when this happens it's the asker's fault, so please excuse me if I didn't state my question clearly.

    My question is: I need one regular expression, which matches lines in an hosts file where the name of a certain machine appears (qualified or not), and it's not associated to that machine's address or to 127.0.1.1.

    BrowserUk's solution doesn't work, because it doesn't cover the case where there is something else between the hostname and the address.

    cosimo's example is catching a good record, not a bad one, so it's out too.

    moritz's procedure doesn't use a single regex, so that doesn't apply as well.

    I was trying to solve that with negative lookbehind (match this name that it's not preceded somewhere in the line by...) but that didn't work for some reason. And, as it turned out, it was even overly complicated.

    The right solution was found by oha, uses negative lookahead (instead of lookbehind), and it's much simpler: this one:

    ^(?!(127\.0\.1\.1|10\.20\.11\.99)\s)([a-fA-F0-9:\.]+)(\s+[a-zA-Z0-9\.\ +-]+)*(\s+kvm-test-v06(?:\.example\.com))(\s+[a-zA-Z0-9\.\-]+)*\s*

    (Some extra grouping is my fault, not oha's).

    I didn't realize that I could say "the beginning of the line is not followed by..." and use lookahead instead of lookbehind.

    Thanks all for trying! And double thanks to oha!

    Ciao!
    --bronto


    In theory, there is no difference between theory and practice. In practice, there is.