Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Blacklisting with a Regular Expression

by BenjiSmith (Novice)
on Aug 18, 2005 at 21:52 UTC ( [id://484969]=perlquestion: print w/replies, xml ) Need Help??

BenjiSmith has asked for the wisdom of the Perl Monks concerning the following question:

I have a tricky regular expression problem that, as it turns out, may not actually be possible. (But if it is possible, I'm sure someone here will know the answer.)

Here's what the problem boils down to...

my @blacklist = ('evil', 'bad', 'wrong'); my $a = "this string contains no blacklisted tokens"; my $b = "this string is evil and wrong"; # The regex should express the blacklist is such a way that # it will match on any string which DOES NOT contain any of # the tokens in the blacklist, and it will fail to match on # any string which DOES contain tokens from the blacklist. my $regex = '?????'; if (($a =~ m/$regex/) && !($b =~ m/$regex/)) { # SUCCESS }

Anyone have any ideas? I've been tinkering with negative lookahaed (which is not my strong suit), but I don't think that's the right approach.

Ideally, what I'd prefer to do is to just search for the blacklist tokens and then negate the result of my match (so that CLEAN strings always return true), but unfortunately, I'm working with a large established codebase, and the regexes are loaded at runtime. I can't modify any of the surrounding code except the regex itself, and I need the expression to match on strings which don't contain any of the tokens in the set. Very tricky.

Thanks for your help.

--Benji

Replies are listed 'Best First'.
Re: Blacklisting with a Regular Expression
by Eimi Metamorphoumai (Deacon) on Aug 18, 2005 at 22:50 UTC
    /^(?!.*(?:evil|bad|wrong))/ seems to work for me. The key was the ^ at the begining. With that, it's forced to try from the begining. Otherwise, it keeps searching until it finds a place that doesn't have any of the blacklisted words (for instace, at the end of the string) and then succeeds.
      /^(?!.*(?:evil|bad|wrong))/
      Bingo. This is the correct solution.

      I tried this one a little earlier (knowing why it wasn't working):
      /^.*(?!evil|bad|wrong)/
      ...but it never occurred to me to put the .* INSIDE the negative lookahead.

      <slaps forehead/>

      Thanks to the rest of you for helping, but most of you actually reversed the problem. The regex needed to return a match from the clean string, not from the string containing the blacklisted tokens.

      Thanks!!

      --Benji

        Just for the sake of completeness, here's an alternative:

        /^(?:(?!evil|bad|wrong).)*\z/

        Eimi's solution is much faster for the problem as given, but this approach gives more control - if your problem becomes any more complex in the future it's a useful trick to have available.

        Hugo

Re: Blacklisting with a Regular Expression
by AReed (Pilgrim) on Aug 18, 2005 at 22:03 UTC
    Maybe I'm misunderstanding the problem but does this do what you want?
    use strict; use warnings; while(<DATA>) { print unless /(evil|bad|wrong)/i; } __DATA__ This string contains no blacklisted tokens. This string is evil and wrong.

    Updated: added i switch.

Re: Blacklisting with a Regular Expression
by davidrw (Prior) on Aug 18, 2005 at 22:05 UTC
    yes, i think you definitely see if anything matches, and then negate that. So if it matches /evil|bad|wrong/ then it is a bad string that should be blacklisted. Now, that regex can be formed dynamically:
    my $good = "this string contains no blacklisted tokens"; my $bad = "this string is evil and wrong"; my @blacklist = ('evil', 'bad', 'wrong'); my $re = join '|', @blacklist; warn 'good string is ok' if $good !~ /\b(?:$re)\b/i; warn 'bad string failed' if $bad =~ /\b(?:$re)\b/i;
    Note that i took the liberty of using the /i modifier, and also added the word boundries. Also note that ?: is just so it doesn't needlessly capture.
    Note also that look-aheads aren't needed here unless you care for some reason about the order of the blacklist hits, but it seems like you only care if they exist at all..
Re: Blacklisting with a Regular Expression
by Nkuvu (Priest) on Aug 18, 2005 at 22:01 UTC
    Not necessarily the optimal solution, but one approach:
    my @blacklist = ('evil', 'bad', 'wrong'); my $a = "this string contains no blacklisted tokens"; my $b = "this string is evil and wrong"; my $regex = join '|', @blacklist; # The !~ is a better way to do !(foo =~ /regex/) if ($a !~ /$regex/) { print "Success with a!\n"; } else { print "Failure with a\n"; } if ($b !~ /$regex/) { print "Success with b!\n"; } else { print "Failure with b\n"; }
    Note that the if condition you had did not match your comments. "Match on any string" generally means you're looking at one string at a time. Of course if you can't modify that code then the comment should change to match.

    Update: Sigh. And looking too much at the "match any string" bit caused me to reverse the logic of what you need. Must think on that a bit more...

    Update the second:The problem with negative lookaheads is that it's easy to match when you don't want to. To simplify your problem a bit, just look at the word evil and try to match something without that in the text.

    my $regex = '.(?!evil)'; if ("this is so evil" =~ /($regex)/) { print "yay: $1\n"; }

    Prints "yay: t" for the simple fact that it matches at the beginning of the string. So you'll get false positives, because you need to check the whole string for the bad token, and negate that. I'd love to see a solution to this regex issue, personally, but I don't know of one.

    Honestly I think the solution will not be a regex, but a change in the surrounding code. Even if it is possible to create a regex that does what you need, it's going to be more complex than making a few simple changes to the conditional. Reverse the logic of the if statement, the regex is trivial. Therefore future programmers won't curse your name and all will be happy in the world. Or something.

Re: Blacklisting with a Regular Expression
by greenFox (Vicar) on Aug 19, 2005 at 06:12 UTC

    I'd be inclined to do that as a sub like this:

    sub is_blacklisted { my $string = $_[0]; for (@blacklist){ # add \b /i to re as required return 1 if ( $string =~ /$_/ ); } return 0; }

    <Rant> Where I am working at the moment they have a set of key words blocked on the proxy server. I discovered it by accident doing a search for "death and taxes", apparently "death" is not an acceptable topic even though it was completely benign in the context of my search, you can't even do a dictionary search for death! "Dead" on the other hand is ok...

    --
    Murray Barton
    Do not seek to follow in the footsteps of the wise. Seek what they sought. -Basho

      I had a similar problem a couple of days ago and opted for the for solution. Are there any advantages either way for using this or a regex?

      Also could you outline what you do with the return value on exiting the sub?

      Thanks

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://484969]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (4)
As of 2024-04-25 22:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found