http://qs321.pair.com?node_id=601766

spivey3587 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all! I'm writing a vulgarity checker for my company that uses a dictionary of regex's to check for bad words. It works great except for one problem I'm having.

This is used for a message board that has a lot of sports related discussions. Some of you may be familiar with the South Carolina Gamecocks, and that's creating a problem in my engine for statements like "I'm a Game Cock fan". The engine basically does this:

study($input); foreach my $word (keys %$vulgar_list) { $regex = $vulgar_list->{$word}->{regex}; if ($input =~ m/$regex/) { $foundvulgar = $word; last; } }

Pardon the profanity here :) What I need is a check for /cocks?/ as long as it isn't preceeded by (a|the|game). Since perl doesn't have a "look behind", and there's no way to do !(a|the|game) within the regex, I'm stumped! I experimented with things like /(?!a|the|game)\s*cock/ but negative lookahead doesn't work that way.

I'm trying to figure out a way to do this in one or two regex's without having a special case in the code for this word (keeping in mind that I have to use a positive regex test with =~).

Any suggestions (even if it has to be multiple regex's) ??

Thanks guys. -Darin

Replies are listed 'Best First'.
Re: Regex solution needed
by Sidhekin (Priest) on Feb 23, 2007 at 16:51 UTC

    Since perl doesn't have a "look behind"

    Actually, Perl has a lookbehind (see perlre). What Perl doesn't have is variable length lookbehind, so if you do want lookbehinds of different lengths, they need to be multiple lookbehinds. Furthermore, the \s* makes it require one more lookbehind. This ought to do it though:

    /(?<!a)(?<!the)(?<!game)(?<!\s)\s*cock/

    ... though I suspect you really want to ensure word boundaries ... and add some /x whitespace for readability:

    / (?<! \b a ) (?<! \b the ) (?<! \b game ) (?<! \s ) \s* cock \b /x

    (But that's just a guess, and in any case beyond your question, so you may want to pretend I did not say that.)

    print "Just another Perl ${\(trickster and hacker)},"
    The Sidhekin proves Sidhe did it!

      And you beat me to whipping up a solution, though plugging your regex into my test proves that your regex works for the test cases I was able to think up:

      #!/usr/bin/perl use strict; use warnings; my @input = ( "cocks are roosters!", "my cocks crow at dawn", "i'm a fan of the cocks", "game cocks all the way!", "do you like the gamecocks?" ); my $vulgar_list = { cocks => { regex => qr/(?<!a)(?<!the)(?<!game)(?<!\s)\s*coc +k/ } }; my $foundvulgar; for my $input( @input ){ study($input); foreach my $word (keys %$vulgar_list) { my $regex = $vulgar_list->{$word}->{regex}; if ($input =~ m/$regex/) { $foundvulgar = $word; last; } } print "phrase: $input found?: $foundvulgar\n"; $foundvulgar = ''; } __OUPUT__ phrase: cocks are roosters! found?: cocks phrase: my cocks crow at dawn found?: cocks phrase: i'm a fan of the cocks found?: phrase: game cocks all the way! found?: phrase: do you like the gamecocks? found?:


      --chargrill
      s**lil*; $*=join'',sort split q**; s;.*;grr; &&s+(.(.)).+$2$1+; $; = qq-$_-;s,.*,ahc,;$,.=chop for split q,,,reverse;print for($,,$;,$*,$/)
        Thanks for all the input, guys. I enhanced the initial suggestion to include a few more lookforward possibilities which I think covers enough bases for me to keep it in the dictionary. I omitted the dirty test cases since these are "sacred" boards, but they did get caught correctly :)

        my @tests = ( "How bout them cocks?", "I'm a big cocks fan", "I love the cocks", "That cocks game was sweet", "Anyone know the cocks score from last night?", "gamecocks rule", "I love the gamecocks, but...", "My favorite cocks player is..." ); foreach my $s (@tests) { if ($s =~ / (?<! \b a ) (?<! \b the ) (?<! \b them ) (?<! \b game ) (?<! \s ) \s* cocks? \b (?! \s fan ) (?! \s game ) (?! \s score ) (?! \s player ) /x) { print "Vulgar: '$s'\n"; } }
        -Darin
      Awesome suggestion, and yes, I do need boundaries. Thank you. Now, I just have to evaluate whether the ratio of legitimate to false positives warrants keeping it in the dictionary. I figured with a tuned solution like the one you suggest, I can get the accuracy good enough to keep it. Considering that team name is one of the only legitimate uses for that word on our boards, I really wanted to find a solution in order to cover the thousands of negative uses I've encountered for it.

      Couldn't that big regexp be simplified to /\b cock s?/x?

      ( Nevermind, yours doesn't censor "game cocks". I was thinking "gamecocks". )

        Correct. Users chatting about the team will either incorrectly separate it into two words, or use a shortened version such as 'the cocks game'.
Re: Regex solution needed
by chakram88 (Pilgrim) on Feb 23, 2007 at 16:49 UTC
    I don't have a regex solution for you just a couple of comments.
    1. Is there a reason you're not using the CPAN Profanity Modules ?
    2. On an ESPN board (obviously sports related conversations) I had the following phrase edited: "Well, you beat me this week." it became "Well, you &*%$#%^ this week".

    I think the edited version made it sound like I said something much worse than I did!

      I did test the CPAN profanity modules so as not to re-invent the wheel, but the results for the data I was testing against threw a lot of false positives. I figured it may be trying to do too much so I went with my own design.

      Funny about your posting and obviously that's a problem with these sort of things. My module will only return true/false, so it's up to the calling program to decide what to do. I've tested with a LOT of real thread postings and have slimmed the dictionary down so that it's lenient. Testing for 'beat' as a vulgar term is ridiculous, IMHO.

      "Well, you beat me this week." it became "Well, you &*%$#%^ this week".

      Once possible profanities have been identified, censoring is just one option. Another would be to list the possible profanities to the user, warn him that profanities are not allowed on the board and allow him to proceed without further editing. This would allow the moderators to delete the post without further warning. On a board that supports moderation, the post could even be withheld until approved by the moderators.

        Once possible profanities have been identified, censoring is just one option.
        This is getting political / philosophical, but I tend to say that censoring is never an option!

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Regex solution needed
by moklevat (Priest) on Feb 23, 2007 at 16:45 UTC
    Where did you get the idea that perl regexes do not have a look-behind? perlre describes both positive and negative look-behinds.

    /(?<!game)cocks/

    matches an occurrence of cocks that does not follow game, but I'm sure someone with stronger regex-fu will provide a more complete solution.

    Bear in mind that your solution does not capture all possible non-vulgar use of cock(s). How about fighting cocks or cock-a-doodle-doo? A popular cheer at South Carolina football games (I am told) is "Hey, how 'bout them cocks!".

      Look-behind is supported but not for variable length patterns, e.g. (?<!game|the)\s+cocks will not work.

      The best solution IMO is to just allow the word 'cocks'.

      Doh! I was a little mis-guided--my Mastering Regular Expressions book say on page 229 'Lookbehind, were it supported, would somehow look back to toward the left'. Turns out the book published in 1999, so shame on me for not checking further.

      Your're correct in that it doesn't cover all possibilities, and that's something I still have to consider. I just got sidetracked with the problem at hand and wanted to address it for the learning experience. Thanks for pointing me in the right direction with lookbehind.

Re: Regex solution needed
by crashtest (Curate) on Feb 23, 2007 at 20:13 UTC

    Certainly an interesting problem. I participate on a message board for my favorite hockey team, which has a too-stringent filter that gets old quite quickly.

    Specifically, the team has drafted a Russian goaltending prospect with the unfortunate name "Semen Varlamov", which is transformed to "Shipsmen Varlamov". I actually sometimes have trouble remembering what his real first name is.

    And, apropos to your "gamecocks" problem, whenever a discussion touches on a coach named "Hitchcock", his name is changed to "Hitchjeffrey".

    I'm hoping the message board you're working on is the one I post to, but I doubt I'd have that kind of luck.

Re: Regex solution needed
by hangon (Deacon) on Feb 24, 2007 at 10:32 UTC

    Darin, something about your problem kept nagging me. Have you considered also keeping an *exceptions* dictionary? This would allow flexibility in adjusting for future problem words without modifying your code. It would also let you handle phrases, and uses two trivial regex's. Something similar to the following:

    # exceptions dictionary our @Exceptions = ( 'game cocks', 'gamecocks', #etc ); # censor dictionary our @Vulgarities = ( 'cocks', 'cock', #etc ); $censored = censor($post); sub censor{ # get a copy of the post my $copy = shift; # remove all exceptions for $except (@exceptions){ $copy =~ s/$except//gi; } # now check for banned words or phrases for $vulgar (@vulgarities){ if ($copy =~ /$vulgar/i){ return 1; } } return; }
Re: Regex solution needed
by hangon (Deacon) on Feb 23, 2007 at 20:46 UTC

    You mentioned that your module will only return true/false and the calling program will decide what to do. However, it may help to look at the bigger picture - like what the calling program actually is going to do.

    This one of the problems in dealing with natural language - so much depends on context. While certain words are only used in a vulgar context, there are also a lot of gray words besides "cock". You can even use normal words in a vulgar context, for example "up yours" can offend, so now you have to match phrases, not just words.

    If the positives are sent to a moderator, you could be strict without too much problem. If they're deleted, you may want to be more lenient. Also, knowing what the calling program does, you may be able to provide better solutions, such as obfuscating gray area words ie: cocks = c****, or providing a true/false/maybe response.

    I don't mean to throw you off track, but sometimes its too easy to get caught up in the details and miss other possible solutions to the overall problem.

Re: Regex solution needed
by educated_foo (Vicar) on Feb 24, 2007 at 14:29 UTC
    If you want to keep out porn spam, what you really want is probably a machine learning solution, e.g. Naive Bayes, of which many free implementations exist.

    If you want to let people say "ah, fuck me ;)" but not "I'm going to rip off your head and doody down your throat," you probably need moderation, meta-moderation, etc.

    If you just want to keep Teh Naughty off the site, you can probably do what another poster mentioned and have a set of exceptions that you manually add to when you notice particular false positives. In this case, you could probably start with a CPAN module. I don't personally think this last goal makes sense, but it's not my site...

Re: Regex solution needed
by CountZero (Bishop) on Feb 24, 2007 at 22:15 UTC
    It pains me to see that our beloved Perl is abused for censorship (and then even censorship of the most silly and hypocritical kind). Fortunately, regex solutions for this kind of "problem" never work and after a lot of time and effort spent, the whole idea is scrapped (as it should have been from the very beginning).

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      Dear CountZero,

      Your post 601925 has been flagged by PerlMonks's UnProfaneMe v2.1 system as containing profanity. Please review the change(s) suggested by UnProfaneMe v2.1 and make appropriate corrections to your text. We thank you for doing your part to keep Teh Internets safe for children who are bright enough to read, but not smart enough to understand context.

      Original: "...after a lot of time and effort spent, the whole idea is scrapped (as it should have been from the very beginning).

      Better: "...after a lot of time and effort spent, the whole idea is sdungped (as it should have been from the very beginning).

      Sincerely,

      The UnProfaneMe Team (v2.1)

      Hear, hear!

      Like the time AOL refused to let people use the location "Scunthorpe" because it failed to get past their filters.
      AOL and Scunthorpe
Re: Regex solution needed
by hangon (Deacon) on Feb 26, 2007 at 19:08 UTC

    I don't like the idea of censorship either. However, there is a big difference between censoring the free flow of opinions, ideas and information, and censoring someone's poor behavior.

    Censorship can get downright ridiculous, but IMHO the owners of a board have the right to limit the content to whatever suits their sensibilities. If you don't like their policies, you don't have to use their board.

Re: Regex solution needed
by NatureFocus (Scribe) on Feb 25, 2007 at 16:39 UTC

    If someone wants to get past this type of filter, it is easy. How about these variations: cokc c0ck cocck c_o_c_k fukc fcuk fucck siht sh!t shti or "h"-"e"-double hockey sticks. (I find the last one particularly offensive)

    Do you need to block these? They are really not obscene, or are they?

    Block the 2-3 worst offending words if you have to, but you might want to leave the rest alone.

    Censorship is a slippery slope. In the US, a TV station can be fined by the FCC $550,000.00 for airing a show/movie where a soldier says "War is hell!". Now thats obscene!

    -Eugene
Re: Regex solution needed
by Moron (Curate) on Feb 27, 2007 at 16:35 UTC
    If the idea is to permit non-profane uses of words that have profane uses, you can't do it without a far greater level of artificial intelligence that can be achieved with regexps alone (would need instead to use very simple regexps to feed a correspondingly heavyweight parser plus a some kind of meaning analyser). For example, "Cock your rifle" could be a valid sentence related to the biathlon.

    And then there's the fact that you say "a lot of ..." rather than "exclusively" suggesting there might be other allowable topics of discussion that use ambiguous words when considered singly which you want to let through.

    Therefore it seems to me that the stated goal is infeasible and can best be exchanged for a feasible one like checking for words which only have a profane meaning or taking on the (far) greater burden of analysing meaning rather than syntax.

    -M

    Free your mind