Re: Regex solution needed
by Sidhekin (Priest) on Feb 23, 2007 at 16:51 UTC
|
Since perl doesn't have a "look behind"
Actually, Perl has a lookbehind (see perlre). What Perl doesn't have is variable length lookbehind, so if you do want lookbehinds of different lengths, they need to be multiple lookbehinds. Furthermore, the \s* makes it require one more lookbehind. This ought to do it though:
/(?<!a)(?<!the)(?<!game)(?<!\s)\s*cock/
... though I suspect you really want to ensure word boundaries ... and add some /x whitespace for readability:
/ (?<! \b a )
(?<! \b the )
(?<! \b game )
(?<! \s )
\s* cock \b
/x
(But that's just a guess, and in any case beyond your question, so you may want to pretend I did not say that.)
print "Just another Perl ${\(trickster and hacker)},"
The Sidhekin proves Sidhe did it!
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
#!/usr/bin/perl
use strict;
use warnings;
my @input = ( "cocks are roosters!",
"my cocks crow at dawn",
"i'm a fan of the cocks",
"game cocks all the way!",
"do you like the gamecocks?" );
my $vulgar_list = {
cocks => { regex =>
qr/(?<!a)(?<!the)(?<!game)(?<!\s)\s*coc
+k/ }
};
my $foundvulgar;
for my $input( @input ){
study($input);
foreach my $word (keys %$vulgar_list) {
my $regex = $vulgar_list->{$word}->{regex};
if ($input =~ m/$regex/) {
$foundvulgar = $word;
last;
}
}
print "phrase: $input found?: $foundvulgar\n";
$foundvulgar = '';
}
__OUPUT__
phrase: cocks are roosters! found?: cocks
phrase: my cocks crow at dawn found?: cocks
phrase: i'm a fan of the cocks found?:
phrase: game cocks all the way! found?:
phrase: do you like the gamecocks? found?:
--chargrill
s**lil*; $*=join'',sort split q**; s;.*;grr; &&s+(.(.)).+$2$1+; $; =
qq-$_-;s,.*,ahc,;$,.=chop for split q,,,reverse;print for($,,$;,$*,$/)
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Thanks for all the input, guys. I enhanced the initial suggestion to include a few more lookforward possibilities which I think covers enough bases for me to keep it in the dictionary. I omitted the dirty test cases since these are "sacred" boards, but they did get caught correctly :)
my @tests = (
"How bout them cocks?",
"I'm a big cocks fan",
"I love the cocks",
"That cocks game was sweet",
"Anyone know the cocks score from last night?",
"gamecocks rule",
"I love the gamecocks, but...",
"My favorite cocks player is..."
);
foreach my $s (@tests)
{
if ($s =~ /
(?<! \b a )
(?<! \b the )
(?<! \b them )
(?<! \b game )
(?<! \s )
\s* cocks? \b
(?! \s fan )
(?! \s game )
(?! \s score )
(?! \s player )
/x)
{
print "Vulgar: '$s'\n";
}
}
-Darin
| [reply] [Watch: Dir/Any] [d/l] |
|
|
Awesome suggestion, and yes, I do need boundaries. Thank you. Now, I just have to evaluate whether the ratio of legitimate to false positives warrants keeping it in the dictionary. I figured with a tuned solution like the one you suggest, I can get the accuracy good enough to keep it. Considering that team name is one of the only legitimate uses for that word on our boards, I really wanted to find a solution in order to cover the thousands of negative uses I've encountered for it.
| [reply] [Watch: Dir/Any] |
|
Couldn't that big regexp be simplified to /\b cock s?/x?
( Nevermind, yours doesn't censor "game cocks". I was thinking "gamecocks". )
| [reply] [Watch: Dir/Any] [d/l] |
|
Correct. Users chatting about the team will either incorrectly separate it into two words, or use a shortened version such as 'the cocks game'.
| [reply] [Watch: Dir/Any] |
Re: Regex solution needed
by chakram88 (Pilgrim) on Feb 23, 2007 at 16:49 UTC
|
| [reply] [Watch: Dir/Any] |
|
I did test the CPAN profanity modules so as not to re-invent the wheel, but the results for the data I was testing against threw a lot of false positives. I figured it may be trying to do too much so I went with my own design.
Funny about your posting and obviously that's a problem with these sort of things. My module will only return true/false, so it's up to the calling program to decide what to do. I've tested with a LOT of real thread postings and have slimmed the dictionary down so that it's lenient. Testing for 'beat' as a vulgar term is ridiculous, IMHO.
| [reply] [Watch: Dir/Any] |
|
"Well, you beat me this week." it became "Well, you &*%$#%^ this week".
Once possible profanities have been identified, censoring is just one option. Another would be to list the possible profanities to the user, warn him that profanities are not allowed on the board and allow him to proceed without further editing. This would allow the moderators to delete the post without further warning. On a board that supports moderation, the post could even be withheld until approved by the moderators.
| [reply] [Watch: Dir/Any] |
|
Once possible profanities have been identified, censoring is just one option. This is getting political / philosophical, but I tend to say that censoring is never an option!
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
Re: Regex solution needed
by moklevat (Priest) on Feb 23, 2007 at 16:45 UTC
|
Where did you get the idea that perl regexes do not have a look-behind? perlre describes both positive and negative look-behinds./(?<!game)cocks/ matches an occurrence of cocks that does not follow game, but I'm sure someone with stronger regex-fu will provide a more complete solution. Bear in mind that your solution does not capture all possible non-vulgar use of cock(s). How about fighting cocks or cock-a-doodle-doo? A popular cheer at South Carolina football games (I am told) is "Hey, how 'bout them cocks!". | [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] |
Re: Regex solution needed
by crashtest (Curate) on Feb 23, 2007 at 20:13 UTC
|
Certainly an interesting problem. I participate on a message board for my favorite hockey team, which has a too-stringent filter that gets old quite quickly.
Specifically, the team has drafted a Russian goaltending prospect with the unfortunate name "Semen Varlamov", which is transformed to "Shipsmen Varlamov". I actually sometimes have trouble remembering what his real first name is.
And, apropos to your "gamecocks" problem, whenever a discussion touches on a coach named "Hitchcock", his name is changed to "Hitchjeffrey".
I'm hoping the message board you're working on is the one I post to, but I doubt I'd have that kind of luck.
| [reply] [Watch: Dir/Any] |
Re: Regex solution needed
by hangon (Deacon) on Feb 24, 2007 at 10:32 UTC
|
Darin, something about your problem kept nagging me. Have you considered also keeping an *exceptions* dictionary? This would allow flexibility in adjusting for future problem words without modifying your code. It would also let you handle phrases, and uses two trivial regex's. Something similar to the following:
# exceptions dictionary
our @Exceptions = (
'game cocks',
'gamecocks', #etc
);
# censor dictionary
our @Vulgarities = (
'cocks',
'cock', #etc
);
$censored = censor($post);
sub censor{
# get a copy of the post
my $copy = shift;
# remove all exceptions
for $except (@exceptions){
$copy =~ s/$except//gi;
}
# now check for banned words or phrases
for $vulgar (@vulgarities){
if ($copy =~ /$vulgar/i){
return 1;
}
}
return;
}
| [reply] [Watch: Dir/Any] [d/l] |
Re: Regex solution needed
by hangon (Deacon) on Feb 23, 2007 at 20:46 UTC
|
You mentioned that your module will only return true/false and the calling program will decide what to do. However, it may help to look at the bigger picture - like what the calling program actually is going to do.
This one of the problems in dealing with natural language - so much depends on context. While certain words are only used in a vulgar context, there are also a lot of gray words besides "cock". You can even use normal words in a vulgar context, for example "up yours" can offend, so now you have to match phrases, not just words.
If the positives are sent to a moderator, you could be strict without too much problem. If they're deleted, you may want to be more lenient. Also, knowing what the calling program does, you may be able to provide better solutions, such as obfuscating gray area words ie: cocks = c****, or providing a true/false/maybe response.
I don't mean to throw you off track, but sometimes its too easy to get caught up in the details and miss other possible solutions to the overall problem.
| [reply] [Watch: Dir/Any] |
Re: Regex solution needed
by educated_foo (Vicar) on Feb 24, 2007 at 14:29 UTC
|
If you want to keep out porn spam, what you really want is probably a machine learning solution, e.g. Naive Bayes, of which many free implementations exist.
If you want to let people say "ah, fuck me ;)" but not "I'm going to rip off your head and doody down your throat," you probably need moderation, meta-moderation, etc.
If you just want to keep Teh Naughty off the site, you can probably do what another poster mentioned and have a set of exceptions that you manually add to when you notice particular false positives. In this case, you could probably start with a CPAN module. I don't personally think this last goal makes sense, but it's not my site... | [reply] [Watch: Dir/Any] |
Re: Regex solution needed
by CountZero (Bishop) on Feb 24, 2007 at 22:15 UTC
|
It pains me to see that our beloved Perl is abused for censorship (and then even censorship of the most silly and hypocritical kind). Fortunately, regex solutions for this kind of "problem" never work and after a lot of time and effort spent, the whole idea is scrapped (as it should have been from the very beginning).
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] [Watch: Dir/Any] |
|
Dear CountZero,
Your post 601925 has been flagged by PerlMonks's UnProfaneMe v2.1 system as containing profanity. Please review the change(s) suggested by UnProfaneMe v2.1 and make appropriate corrections to your text. We thank you for doing your part to keep Teh Internets safe for children who are bright enough to read, but not smart enough to understand context.
Original: "...after a lot of time and effort spent, the whole idea is scrapped (as it should have been from the very beginning).
Better: "...after a lot of time and effort spent, the whole idea is sdungped (as it should have been from the very beginning).
Sincerely,
The UnProfaneMe Team (v2.1)
| [reply] [Watch: Dir/Any] |
|
Hear, hear!
Like the time AOL refused to let people use the location "Scunthorpe" because it failed to get past their filters.
AOL and Scunthorpe
| [reply] [Watch: Dir/Any] |
Re: Regex solution needed
by hangon (Deacon) on Feb 26, 2007 at 19:08 UTC
|
I don't like the idea of censorship either. However, there is a big difference between censoring the free flow of opinions, ideas and information, and censoring someone's poor behavior.
Censorship can get downright ridiculous, but IMHO the owners of a board have the right to limit the content to whatever suits their sensibilities. If you don't like their policies, you don't have to use their board.
| [reply] [Watch: Dir/Any] |
Re: Regex solution needed
by NatureFocus (Scribe) on Feb 25, 2007 at 16:39 UTC
|
If someone wants to get past this type of filter, it is easy. How about these variations: cokc c0ck cocck c_o_c_k fukc fcuk fucck siht sh!t shti or "h"-"e"-double hockey sticks. (I find the last one particularly offensive)
Do you need to block these? They are really not obscene, or are they?
Block the 2-3 worst offending words if you have to, but you might want to leave the rest alone.
Censorship is a slippery slope. In the US, a TV station can be fined by the FCC $550,000.00
for airing a show/movie where a soldier says "War is hell!". Now thats obscene!
-Eugene
| [reply] [Watch: Dir/Any] |
Re: Regex solution needed
by Moron (Curate) on Feb 27, 2007 at 16:35 UTC
|
If the idea is to permit non-profane uses of words that have profane uses, you can't do it without a far greater level of artificial intelligence that can be achieved with regexps alone (would need instead to use very simple regexps to feed a correspondingly heavyweight parser plus a some kind of meaning analyser). For example, "Cock your rifle" could be a valid sentence related to the biathlon.And then there's the fact that you say "a lot of ..." rather than "exclusively" suggesting there might be other allowable topics of discussion that use ambiguous words when considered singly which you want to let through.
Therefore it seems to me that the stated goal is infeasible and can best be exchanged for a feasible one like checking for words which only have a profane meaning or taking on the (far) greater burden of analysing meaning rather than syntax.
| [reply] [Watch: Dir/Any] |