http://qs321.pair.com?node_id=957247

jfrm has asked for the wisdom of the Perl Monks concerning the following question:

I need to conjure a routine that given a set of strings will find any swear words or other words of great dubiousness within it. Are there any existing modules that do something like this? I worshipped at CPAN for a few mins but couldn't find anything.

Failing that, the routine is fairly straight forward using regexp. The hardest bit is just getting a hefty list of naughty words - does anyone know of a temple where I might find such a list?

Replies are listed 'Best First'.
Re: Profanity and expletives
by moritz (Cardinal) on Mar 01, 2012 at 16:14 UTC

      That does it. Only thing wrong there was my searching skills. Thanks.

Re: Profanity and expletives
by CountZero (Bishop) on Mar 01, 2012 at 17:23 UTC
    Take the list of words found in Delete Expletives and What not to swear and assemble them into one big regex thanks to Regexp::Assemble.

    Of course your regex will only be as good (or bad) as the lists you use.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Profanity and expletives
by JavaFan (Canon) on Mar 01, 2012 at 16:25 UTC
    The biggest problem I have with this is that what one considers a swear word or not, is very subjective. I find words like "gun", "God" and "religion" far more dubious and harmful to children than words like "sex" or "breast".

    Do you really want to rely on some unknown figure to come up with a list of "taboo" words?

      Good point in many contexts but in my case, it doesn't matter because I only want to use it for statistical analysis of risk - it doesn't matter if I have some false positives. The larger the list the better, really.

        > statistical analysis of risk

        you most certainly want to search for statistical spam filters.

        Cheers Rolf

        If you don't mind false positives, start with /usr/share/dict/words. Or, as a regexp with false positives, /\S+/g.
Re: Profanity and expletives
by tweetiepooh (Hermit) on Mar 01, 2012 at 16:38 UTC
    And without care you can end up flagging bad words that are contained within others.

    And context is important. A word like tit can be offensive but not to ornithologists. At least not all the time.
      > bad words that are contained within others.

      Reminds me of a university friend back in the 90s who was totally confused that the emails of his English girlfriend never reached his mailbox.

      Took him weeks to find out that the University of Sussex was blocked for xxx spamming ...

      Cheers Rolf

      PS: Those filthy Brits again! ;-)

        There's also the famous Scunthorpe problem. Masak mentioned a few of these at the last London Perl Workshop.

        Regards,

        John Davies