Please help with Regexp::Common

scorpio17 has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to teach myself how to use Regexp::Common, and I'm having a little trouble.

The following works as expected, and finds the number 1234 embedded in the string aaaa1234cccc:

use strict;
use Regexp::Common;

while ( my $word = <DATA> ) {
  chomp $word;
 if ( $word =~ /$RE{num}{int}/ ) {
    print "Integer detected: \"$word\"\n";
  } else {
    print "$word\n";
  }
}

__DATA__
aaaabbbbcccc
aaaa1234cccc
ddddeeeeffff
[download]

However, this does NOT work as I would expect:

use strict;
use Regexp::Common;

while ( my $word = <DATA> ) {
  chomp $word;
  if ( $word =~ /$RE{profanity}/ ) {
    print "Profanity detected: \"$word\"\n";
  } else {
    print "$word\n";
  }
}

__DATA__
aaaabbbbcccc
aaaaXXXXcccc
ddddeeeeffff
[download]

In this case, change XXXX into your favorite 4 letter offensive word. If I change the data string to this: "aaaa XXXX cccc" (i.e., add spaces around the XXXX, then it finds it).

It seems like the profanity patterns have start of word / end of word anchors built into the patterns, and thus don't work if the word is embedded inside another string? Is there any way to control this behavior? I've gone through the docs, but so far I can't find a way.

I'm using perl 5.14 (activestate) on Win7. Thanks for any push in the right direction.

Comment on Please help with Regexp::Common Select or Download Code

Replies are listed 'Best First'.
Re: Please help with Regexp::Common by LanX (Saint) on Jan 18, 2017 at 23:55 UTC
> However, this does NOT work as I would expect: Really? Well swear words having word boundaries is what I expect. > It seems like the profanity patterns have start of word / end of word anchors built into the patterns, well it seems so, why don't you just dump the regex to be sure? Personally I wouldn't want words like Essex to be flagged. (Or Dickens or zaddick) > don't work if the word is embedded inside another string? Is there any way to control this behavior? After browsing thru the code ... http://cpansearch.perl.org/src/ABIGAIL/Regexp-Common-2016060801/lib/Regexp/Common/profanity.pm I saw this `pattern name => [qw (profanity)], create => '(?:\b(?k:' . $profanity . + ')\b)', ;` [download] So I doubt there is any possible flag to disable the hard coded `\b` meta character. But if you really need this feature you could just copy the code into your own subclass and change the pattern to your needs. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l] [select]
Re: Please help with Regexp::Common by Paladin (Vicar) on Jan 18, 2017 at 23:52 UTC
You can see in the source that the `\b` anchors are embedded in the regex itself. I would imagine this is because of the Scunthorpe Problem.	[reply] [d/l]
Re^2: Please help with Regexp::Common (Kant ) by LanX (Saint) on Jan 19, 2017 at 00:28 UTC
> because of the Scunthorpe Problem. I once ran in a similar phonetic problem after mentioning Kant in an English conversation :) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re: Please help with Regexp::Common by AnomalousMonk (Archbishop) on Jan 19, 2017 at 00:01 UTC
You might try to trim the boundary assertions off of the stringized `Regexp` object (sorry for all the wrap-around): c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common; ;; print qq{$RE{profanity}}; print qq{A: match '$1'} if 'xxxpissxxx' =~ m{ ($RE{profanity}) }xms; ;; print '--------'; (my $erp = $RE{profanity}) =~ s{ \A \Q(?:\b\E (.) \Q\b)\E \z }{$1}xm +s; print qq{'$erp'}; ;; print qq{B: match '$1'} if 'xxxpissxxx' =~ m{ ($erp) }xms; " (?:\b(?:(?:piss(?:\ take\|\-take\|take\|e(?:rs\|[srd])\|ing\|y)?\|quims?\|shit +(?:t(?:e(?:rs\|[dr])\|ing\|y)\|e(? :rs\|[sdry])\|ing\|[se])?\|t(?:urds?\|wats?)\|wank(?:e(?:rs\|[rd])\|ing\|s)?\|a( +?:rs(?:e(?:\ hole\|\-hole\|hole\| [sd])\|ing\|e)\|ss(?:\ holes?\|\-holes?\|ed\|holes?\|ing))\|b(?:ull(?:\ shit(? +:t(?:e(?:rs\|[dr])\|ing)\|s)?\|\-s hit(?:t(?:e(?:rs\|[dr])\|ing)\|s)?\|shit(?:t(?:e(?:rs\|[dr])\|ing)\|s)?)\|low( +?:\ jobs?\|\-jobs?\|jobs?))\|c(?: ock(?:\ suck(?:ers?\|ing)\|\-suck(?:ers?\|ing)\|suck(?:ers?\|ing))\|rap(?:p( +?:e(?:rs\|[rd])\|ing\|y)\|s)?\|u(?: nts?\|m(?:ing\|ming\|s)))\|dick(?:\ head\|\-head\|ed\|head\|ing\|less\|s)\|f(?:uc +k(?:ed\|ing\|s)?\|art(?:e[rd]\|ing \|[sy])?\|eltch(?:e(?:rs\|[rsd])\|ing)?)\|ha(?:rd[\-\ ]?on\|lf(?:\ a[sr]\|\-a +[sr]\|a[sr])sed)\|m(?:other(?:\ fuck(?:ers?\|ing)\|\-fuck(?:ers?\|ing)\|fuck(?:ers?\|ing))\|uth(?:a(?:\ fuck +(?:ers?\|ing\|[aaa])\|\-fuck(?:er s?\|ing\|[aaa])\|fuck(?:ers?\|ing\|[aaa]))\|er(?:\ fuck(?:ers?\|ing)\|\-fuck(? +:ers?\|ing)\|fuck(?:ers?\|ing)))\| erde?)))\b) -------- '(?:(?:piss(?:\ take\|\-take\|take\|e(?:rs\|[srd])\|ing\|y)?\|quims?\|shit(?:t +(?:e(?:rs\|[dr])\|ing\|y)\|e(?:rs\| [sdry])\|ing\|[se])?\|t(?:urds?\|wats?)\|wank(?:e(?:rs\|[rd])\|ing\|s)?\|a(?:rs +(?:e(?:\ hole\|\-hole\|hole\|[sd] )\|ing\|e)\|ss(?:\ holes?\|\-holes?\|ed\|holes?\|ing))\|b(?:ull(?:\ shit(?:t(? +:e(?:rs\|[dr])\|ing)\|s)?\|\-shit( ?:t(?:e(?:rs\|[dr])\|ing)\|s)?\|shit(?:t(?:e(?:rs\|[dr])\|ing)\|s)?)\|low(?:\ +jobs?\|\-jobs?\|jobs?))\|c(?:ock( ?:\ suck(?:ers?\|ing)\|\-suck(?:ers?\|ing)\|suck(?:ers?\|ing))\|rap(?:p(?:e( +?:rs\|[rd])\|ing\|y)\|s)?\|u(?:nts? \|m(?:ing\|ming\|s)))\|dick(?:\ head\|\-head\|ed\|head\|ing\|less\|s)\|f(?:uck(?: +ed\|ing\|s)?\|art(?:e[rd]\|ing\|[sy ])?\|eltch(?:e(?:rs\|[rsd])\|ing)?)\|ha(?:rd[\-\ ]?on\|lf(?:\ a[sr]\|\-a[sr] +\|a[sr])sed)\|m(?:other(?:\ fuck (?:ers?\|ing)\|\-fuck(?:ers?\|ing)\|fuck(?:ers?\|ing))\|uth(?:a(?:\ fuck(?:e +rs?\|ing\|[aaa])\|\-fuck(?:ers?\|i ng\|[aaa])\|fuck(?:ers?\|ing\|[aaa]))\|er(?:\ fuck(?:ers?\|ing)\|\-fuck(?:ers +?\|ing)\|fuck(?:ers?\|ing)))\|erde ?)))' B: match 'piss' [download] Update:* Of course, this gets you right back to the Scunthorpe Problem noted above by Paladin! Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Please help with Regexp::Common by scorpio17 (Canon) on Jan 19, 2017 at 15:50 UTC
I followed your suggestion and tried this: `use strict; use Regexp::Common; (my $reg = $RE{profanity}) =~ s{\A \Q(?:\b\E (.*) \Q\b)\E \z}{$1}xms; while ( my $word = <DATA> ) { chomp $word; if ( $word =~ m/$reg/ ) { print "Profanity detected: \"$word\"\n"; } else { print "$word\n"; } } __DATA__ aaaabbbbcccc aaaashitcccc aaaa1234cccc ddddeeeeffff` [download] This way it will find embedded "bad words" without the need for spaces around them, which is what I wanted. I realize the logic in requiring the word boundaries. But I think the fact that $RE{num}{int} finds embedded numbers made me assume that $RE{profanity} should work the same way, or else there might be a switch to toggle the behavior one way or the other. The reason I need this is to generate temporary (one-use) passwords (like when someone requests a password reset on a website). The generated password should, ideally, be a jumble of random letters and/or numbers, but I don't want to accidentally send someone a password with an "obvious" obscenity embedded, so a simple filter like this is helpful. Thanks!	[reply] [d/l]
Re^3: Please help with Regexp::Common by AnomalousMonk (Archbishop) on Jan 19, 2017 at 18:11 UTC
You might consider adding a test to check if the expected alteration to the original regex was successful. The `\Q(?:\b\E` and `\Q\b)\E` parts of the substitution are rather fragile IMO and may break if the maintainer(s) of Regexp::Common ever change his/her/their notion of what a proper profane regex should look like. `c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common; ;; (my $reg = $RE{profanity}) =~ s{\A \Q(?:\b\E (.*) \Q\b)\E \z}{$1}xms or die 'profanity anchor trim failed'; ;; print qq{bad: '$1'} if 'Matsushita' =~ m{ ($reg) }xms; " bad: 'shit'` [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: Please help with Regexp::Common by Mr. Muskrat (Canon) on Jan 19, 2017 at 16:14 UTC
Shouldn't you be generating passwords that do not contain any words?	[reply]
Re^4: Please help with Regexp::Common by afoken (Chancellor) on Jan 19, 2017 at 18:38 UTC
Re^5: Please help with Regexp::Common by Mr. Muskrat (Canon) on Jan 19, 2017 at 18:41 UTC
Some notes below your chosen depth have not been shown here
Re^4: Please help with Regexp::Common by AnomalousMonk (Archbishop) on Jan 19, 2017 at 18:30 UTC


Perl: the Markov chain saw
	PerlMonks