prohibiting certain strings

keiusui has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: prohibiting certain strings by atcroft (Abbot) on Dec 29, 2005 at 00:48 UTC
As there is often more than one way to do it, here is what I mentioned in the CB (for your reference): `12/28 at 18:19:49 <atcroft> keiusui: exit if ($input =~ m/b\Wa\Wd\W* +w\Wo\Wr\W*d/i);, maybe? (note: untested)` [download] You may also want to look at Regex::Common::profanity as another possibility. HTH.	[reply] [d/l]
Re^2: prohibiting certain strings by keiusui (Monk) on Dec 29, 2005 at 04:15 UTC
Thank you so much for all the insight and help. I will be using atcroft's solution: m/b\Wa\Wd\W* +w\Wo\Wr\W*d/i)	[reply]
Re: prohibiting certain strings by phaylon (Curate) on Dec 28, 2005 at 23:29 UTC
If you want to develop a badwords-filter for a commenting-system or such, I'm afraid I made the experience that there is no automatic filter that can stop people from using unwanted language. I found moderation/approvement easying functions much more successful. Ordinary morality is for ordinary people. -- Aleister Crowley	[reply]
Re: prohibiting certain strings by dimar (Curate) on Dec 29, 2005 at 00:01 UTC
Not to mention the fact that if you get too clever with your 'badword' filter, you will inevitably wind up filtering legitimate words. This is often more annoying than not having a filter at all. Can you guess why all of the following lines might not pass a badword test?: `A: who reads the news? B: if they publish it, flip reads it A: flip is so well-read B: yup, pen is too, but he's a bit cocky A: probably cause he drives a FiretrUCK, You!` [download] "bad words" are everywhere, if you look hard enough. If you look too hard, you will wind up irritating the polite people, and giving the naughty potty mouths one more way to mock you, and your 'clever' filter. =oQDlNWYsBHI5JXZ2VGIulGIlJXYgQkUPxEIlhGdgY2bgMXZ5VGIlhGV	[reply] [d/l]
Re^2: prohibiting certain strings by Anomynous Monk (Scribe) on Dec 29, 2005 at 04:48 UTC
Everywhere!	[reply]
Re: prohibiting certain strings by diotalevi (Canon) on Dec 28, 2005 at 23:18 UTC
Insert something like (?s:.) between each character. `/b(?s:.)a(?s:.)d(?s:.)w(?s:.)o(?s:.)r(?s:.*)d/i` ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊	[reply] [d/l]
Re^2: prohibiting certain strings by ikegami (Patriarch) on Dec 29, 2005 at 02:12 UTC
Not quite. The sentence "She hid in the closet" would be flagged as a badword.	[reply]
Re^3: prohibiting certain strings by diotalevi (Canon) on Dec 29, 2005 at 03:36 UTC
I considered that but figured that I'd rather be restrictive than permissive. ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊	[reply]
Re: prohibiting certain strings by bart (Canon) on Dec 28, 2005 at 23:31 UTC
Drop the nonword characters, and try again. `$input = "There's the b/a/d/w/o/r/d."; (my $test = $input) =~ s/\W+//g; if($test =~ /badword/) { print "You sneaky devil!\n"; }` [download]	[reply] [d/l]
Re^2: prohibiting certain strings by ptum (Priest) on Dec 28, 2005 at 23:39 UTC
Of course, that still leaves 'badw0rd' and 'b_a_d_w_o_r_d', but it is a start. I would be more restrictive: `(my $test = $input) =~ s/[^A-Z]//ig;` [download] This will still permit 'baaadwooord', unfortunately. I'm inclined to agree with [id://phaylon], in that there is no substitute for moderation. :) Update: forgot the ^ character. <blush>	[reply] [d/l]
Re: prohibiting certain strings by TedPride (Priest) on Dec 29, 2005 at 03:18 UTC
The problem is that any script restrictive enough to catch everything will also flag legitimate posts (as mentioned above). Probably the best thing to do is score users on the number of bad words they use, and if the count goes over a certain total number and a certain average number per post, then have them banned automatically. This avoids the problem of people seeing their post has been blocked and then making creative variations (since they won't know you're counting the bad words until the count slays them), and it should be pretty easy to look up people who are high on the bad word ranking but haven't yet been officially banned by you (as opposed to auto-banned) and check for genuine bad words in their posts. Needless to say, an exact match for a bad word would be scored high, whereas a match once characters have been removed (or added - symbols are often used in place of letters) would be scored less. One might debate that a swear word creative enough to defeat all algorithms is not as damaging as a regular swear word anyhow.	[reply]


Think about Loose Coupling
	PerlMonks