Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Bad Language In Contact Messages

by Milti (Sexton)
on Jul 07, 2022 at 17:33 UTC ( [id://11145339]=perlquestion: print w/replies, xml ) Need Help??

Milti has asked for the wisdom of the Perl Monks concerning the following question:

My website has a contact form that uses PERL to send messages to me via an email service. I want to prevent sending messages that contain words such as "porn", "sex", etc. The present program looks like this:
$Name=param('Name'); $Organization=param('Organization'); $Email=param('Email'); $Phone=param('Phone'); $Message=param('Message'); $Number=param('Spam Number'); $Spam=param('Spam'); $Robot=param('Robot'); $Result=param('Result'); if ($Name eq ""||$Organization eq ""||$Email eq ""||$Phone eq ""||$Messag +e eq "") { print "The FORM is INCOMPLETE. All Fields Marked With An * Are Require +d"; exit; } if ($Number ne "$Spam") { print "You Entered An Incorrect Anti Spam Number. Please try again."; exit; } if ($Result ne "5") { print "You Entered An Incorrect Answer. Please try again."; exit; } if ($Robot ne "1") { print "The Form Is Not Complete. Please try again."; exit }
The part that sends the message was omitted.

I want establish an array that will contain prohibited words and then search the $Message. If $Message contains any of those words I want to print a response such as "Bad Words Not Permitted. Please try again." I have seen some examples of how to do all this but have not been able to get anything to work. Either the program responds to the first word in the array or nothing happens. Here's an example that I found and tried:

@forbidden =("porn","sex"); foreach $forbidden (@forbidden) { if ($Message =~/$forbidden/) { print "Bad Language Not Permitted";] exit;}
This works for the first word but not the second, and if neither are present the rest of the program does not work.

Can someone suggest a solution or tell me what I'm doing wrong? Any and all help will be appreciated.Tks!

Replies are listed 'Best First'.
Re: Bad Language In Contact Messages
by haukex (Archbishop) on Jul 07, 2022 at 17:49 UTC
    Can someone suggest a solution or tell me what I'm doing wrong?

    The exit needs to be inside the if.

    For a more efficient way to do what you're doing without foreach, see my tutorial Building Regex Alternations Dynamically.

Re: Bad Language In Contact Messages
by LanX (Saint) on Jul 07, 2022 at 18:30 UTC
    There are various modules regarding profanity on CPAN

    Like https://metacpan.org/pod/Regexp::Profanity::US

    So you don't need to implement your own solution.

    More general you might need a generic spam word solution, e.g. "Nigerian prince" is not really a profanity as such... ;)

    On another note:

    I'm not sure if returning a detailed error message is a good approach against spammers, they might use it to train against your filter.

    So at least consider a JS based feedback, to make it harder to be spotted by bots.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: Bad Language In Contact Messages
by roho (Bishop) on Jul 08, 2022 at 05:32 UTC
    That is a noble goal you have, for which I applaud you, but as written your sample code does not compile because the print statement ends with a square bracket instead of a curly brace. Also remember to always use strict and warnings.

    "It's not how hard you work, it's how much you get done."

Re: Bad Language In Contact Messages
by harangzsolt33 (Chaplain) on Jul 08, 2022 at 10:44 UTC
    Hmm... I hate when I type something, and the website discards my message, because it thinks that I used some kind of bad language. I can think of a few situations. Also, you should maybe remove spaces from the text, because I could still play the program and get past with some bad words by inserting spaces between the letters. However, if your program uses tr to get rid of everything except letters, then the words will flow together, and you will falsely reject some letters which seem like they contain bad words but they don't. For example 'helloworld' starts with the word 'hell' so you should check if that bad word is maybe part of another word. Also, "Holy" in itself is not a bad word. But if you use it together with "Holy sh*t" then it is. Okay, so you only look for the S word. Great. But if they write "weaver's hitch" and remove everything except letters, then you got the bad word. So, it's really easy to get flagged, and it's hard to catch only the bad stuff. Like I said, I hate when some websites prevent me from posting. Facebook does this a lot. They have some sort of filter. I press "Send" and my message simply disappears. There are combination of words they dislike probably, but I haven't been able to figure it out yet.

    #!/usr/bin/perl use strict; use warnings; my @BADWORDS = ('hell'); $a = ' H e l l o w o r l d ! '; $a =~ tr|a-zA-Z||cd; $a = lc($a); print $a;
Re: Bad Language In Contact Messages
by JayBee (Scribe) on Jul 09, 2022 at 02:02 UTC
    My preference is to set a variables to true or assign point system, then judge afterwards.
    my $fail=0; my @ErrMsg; my @forbidden=('porn','sex'); if (map {$Message =~ /$_/i} @forbidden){ $fail=1; # or $fail++; } if ($fail){ # or ($fail > 2) push @ErrMsg,"Bad Language Not Permitted"; }

      This problem space is pretty difficult to get right even in naïve cases.

      ack -i sex /usr/share/dict/words | nl 1 ambosexous # ... 45 misexplain # ... 199 Wessexman

      Quite a few false positives there and it will pass through these and their variations or the even more common emoji replacements for the topic I won’t include.

      s3x & pr0n
      s̸͚͍̺̺͇̝̥͎͕̃e̸̲͆̅̈́̈̈́̆͆̀͆͘͜x̶̧̟̭̘̿̂̈́͝ ̷̨͇͓͖̅̄͑̓̕͝a̵̰̜̣͎͔͛͗̆̋̓n̸͍̳͍̤̊́͌͌͗͂͘d̵͇̱̣̈́̀͒͆̐͠͝ ̷̡͓̥̙͕̦͙̬̠͒̐ͅp̶̡̖̫̱̞̙͔͑̊͛͌̈́͘ô̷̫̰̫̹̘̖̪͙͌͋͗̓̓̂̿r̴͎͇͖̲̦̤͕̉n̸̘̝͇͌̂͒̈́͝
      รєא คภ๔ ק๏гภ
      ˢᵉˣ ᵃⁿᵈ ᵖᵒʳⁿ
      𝚜𝚎𝚡 𝚊𝚗𝚍 𝚙𝚘𝚛𝚗
      uɹod puɐ xǝs
      🅂🄴🅇 🄰🄽🄳 🄿🄾🅁🄽
      🎀 𝓈𝑒𝓍 𝒶𝓃𝒹 𝓅🍑𝓇𝓃 🎀
      𝘀𝗲𝘅 𝗮𝗻𝗱 𝗽𝗼𝗿𝗻

      Update: I ran those through Text::Unidecode to see if it would untangle them because it’s surprisingly good at that, usually. Not here. Best conversion on any of them was s[?]x [?]n[?] [?][?]rn. The rest were nowhere close.

        Okay, this is silly and I probably could have found a better way to spend the last couple hours but I was curious and surprised by Text::Unidecode’s lack of efficacy here so… (has to be <pre/> instead of <code/> tags because of content.)

        Posting it because I do these things so rarely I’ll lose the memory of how to do it if I don’t. :P

        #!/usr/bin/env perl
        use 5.14.0; # Not sure this is earliest/best.
        use utf8;
        use strictures;
        use open ":std", ":encoding(utf8)";
        use Unicode::UCD "charinfo";
        use charnames ":full";
        use YAML; # For introspection if you're curious.
        
        my @sex_and_porn = <DATA>;
        
        for my $sap ( @sex_and_porn )
        {
            chomp $sap;
            my @string;
            for my $chr ( split "", $sap )
            {
                next unless $chr =~ /\A[[:print:]]+\z/;
                my $info = charinfo(ord($chr));
                # print Dump($info);
                my $name = $info->{name};
                next if $name =~ /COMBINING/;
                my $replacement;
                if ( $name =~ /\b(SMALL|CAPITAL)?(?: LETTER)?(?: TURNED| SHARP)? ([A-Z])\b/ )
                {
                    no warnings "uninitialized";
                    my $case = $1 eq "SMALL" ? "SMALL" : "CAPITAL";
                    $replacement = "LATIN $case LETTER $2";
                }
        
                my $final = charnames::string_vianame( $replacement || $name );
        
                push @string, $final =~ /(?:\b[A-Z]\b| )/i ?
                    $final : "\x{0}";
            }
            s/\x{0}+/[?]/g for @string;
            printf " %15s -> %s\n\n", $sap, join "", @string;
        }
        
        __DATA__
        s3x & pr0n
        s̸͚͍̺̺͇̝̥͎͕̃e̸̲͆̅̈́̈̈́̆͆̀͆͘͜x̶̧̟̭̘̿̂̈́͝ ̷̨͇͓͖̅̄͑̓̕͝a̵̰̜̣͎͔͛͗̆̋̓n̸͍̳͍̤̊́͌͌͗͂͘d̵͇̱̣̈́̀͒͆̐͠͝ ̷̡͓̥̙͕̦͙̬̠͒̐ͅp̶̡̖̫̱̞̙͔͑̊͛͌̈́͘ô̷̫̰̫̹̘̖̪͙͌͋͗̓̓̂̿r̴͎͇͖̲̦̤͕̉n̸̘̝͇͌̂͒̈́͝
        รєא คภ๔ ק๏гภ
        ˢᵉˣ ᵃⁿᵈ ᵖᵒʳⁿ
        𝚜𝚎𝚡 𝚊𝚗𝚍 𝚙𝚘𝚛𝚗
        uɹod puɐ xǝs
        🅂🄴🅇 🄰🄽🄳 🄿🄾🅁🄽
        🎀 𝓈𝑒𝓍 𝒶𝓃𝒹 𝓅🍑𝓇𝓃 🎀
        𝘀𝗲𝘅 𝗮𝗻𝗱 𝗽𝗼𝗿𝗻 
        ßêx and pørñ
        ẞӬ𐊜 AƝḊ ⓅỚℝꞐ
        

        Output–

              s3x & pr0n -> s[?]x [?] pr[?]n
        
         s̸͚͍̺̺͇̝̥͎͕̃e̸̲͆̅̈́̈̈́̆͆̀͆͘͜x̶̧̟̭̘̿̂̈́͝ ̷̨͇͓͖̅̄͑̓̕͝a̵̰̜̣͎͔͛͗̆̋̓n̸͍̳͍̤̊́͌͌͗͂͘d̵͇̱̣̈́̀͒͆̐͠͝ ̷̡͓̥̙͕̦͙̬̠͒̐ͅp̶̡̖̫̱̞̙͔͑̊͛͌̈́͘ô̷̫̰̫̹̘̖̪͙͌͋͗̓̓̂̿r̴͎͇͖̲̦̤͕̉n̸̘̝͇͌̂͒̈́͝ -> sex and porn
        
            รєא คภ๔ ק๏гภ -> [?][?][?] [?][?][?] [?][?][?][?]
        
            ˢᵉˣ ᵃⁿᵈ ᵖᵒʳⁿ -> sex and porn
        
            𝚜𝚎𝚡 𝚊𝚗𝚍 𝚙𝚘𝚛𝚗 -> sex and porn
        
            uɹod puɐ xǝs -> urod pua xes
        
            🅂🄴🅇 🄰🄽🄳 🄿🄾🅁🄽 -> SEX AND PORN
        
         🎀 𝓈𝑒𝓍 𝒶𝓃𝒹 𝓅🍑𝓇𝓃 🎀 -> [?] sex and p[?]rn [?]
        
           𝘀𝗲𝘅 𝗮𝗻𝗱 𝗽𝗼𝗿𝗻  -> sex and porn 
        
            ßêx and pørñ -> sex and porn
        
            ẞӬ𐊜 AƝḊ ⓅỚℝꞐ -> SEX AND PORN
        
        

        And it gets worse if you have to support multiple languages. If the company is in North America, they can probably get away with supporting 3 different languages (spanish, french and english).

        If the company in the EU, there are 24 official languages at last count: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish.

        You might get away with supporting fewer languages in some private sectors, but if you are running a government system (or a system that wants to receive funding from the government), you might have to support all of those. And as anyone can imagine, a "bad" word in one of those languages might be a standard word with a different meaning in another language.

        Plus, even for countries speaking the same language, there might be big cultural differences. For U.S. citizens, a "tea party" means throwing valuable cargo into the harbor because they endorse "taxation without representation" but want to keep all the taxes for their own government(*). For Brits, it means "5 p.m."

        (*) Puerto Rico

        PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11145339]
Approved by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-26 02:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found