Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Regex solution needed

by Sidhekin (Priest)
on Feb 23, 2007 at 16:51 UTC ( #601772=note: print w/replies, xml ) Need Help??


in reply to Regex solution needed

Since perl doesn't have a "look behind"

Actually, Perl has a lookbehind (see perlre). What Perl doesn't have is variable length lookbehind, so if you do want lookbehinds of different lengths, they need to be multiple lookbehinds. Furthermore, the \s* makes it require one more lookbehind. This ought to do it though:

/(?<!a)(?<!the)(?<!game)(?<!\s)\s*cock/

... though I suspect you really want to ensure word boundaries ... and add some /x whitespace for readability:

/ (?<! \b a ) (?<! \b the ) (?<! \b game ) (?<! \s ) \s* cock \b /x

(But that's just a guess, and in any case beyond your question, so you may want to pretend I did not say that.)

print "Just another Perl ${\(trickster and hacker)},"
The Sidhekin proves Sidhe did it!

Replies are listed 'Best First'.
Re^2: Regex solution needed
by chargrill (Parson) on Feb 23, 2007 at 17:19 UTC

    And you beat me to whipping up a solution, though plugging your regex into my test proves that your regex works for the test cases I was able to think up:

    #!/usr/bin/perl use strict; use warnings; my @input = ( "cocks are roosters!", "my cocks crow at dawn", "i'm a fan of the cocks", "game cocks all the way!", "do you like the gamecocks?" ); my $vulgar_list = { cocks => { regex => qr/(?<!a)(?<!the)(?<!game)(?<!\s)\s*coc +k/ } }; my $foundvulgar; for my $input( @input ){ study($input); foreach my $word (keys %$vulgar_list) { my $regex = $vulgar_list->{$word}->{regex}; if ($input =~ m/$regex/) { $foundvulgar = $word; last; } } print "phrase: $input found?: $foundvulgar\n"; $foundvulgar = ''; } __OUPUT__ phrase: cocks are roosters! found?: cocks phrase: my cocks crow at dawn found?: cocks phrase: i'm a fan of the cocks found?: phrase: game cocks all the way! found?: phrase: do you like the gamecocks? found?:


    --chargrill
    s**lil*; $*=join'',sort split q**; s;.*;grr; &&s+(.(.)).+$2$1+; $; = qq-$_-;s,.*,ahc,;$,.=chop for split q,,,reverse;print for($,,$;,$*,$/)
      Thanks for all the input, guys. I enhanced the initial suggestion to include a few more lookforward possibilities which I think covers enough bases for me to keep it in the dictionary. I omitted the dirty test cases since these are "sacred" boards, but they did get caught correctly :)

      my @tests = ( "How bout them cocks?", "I'm a big cocks fan", "I love the cocks", "That cocks game was sweet", "Anyone know the cocks score from last night?", "gamecocks rule", "I love the gamecocks, but...", "My favorite cocks player is..." ); foreach my $s (@tests) { if ($s =~ / (?<! \b a ) (?<! \b the ) (?<! \b them ) (?<! \b game ) (?<! \s ) \s* cocks? \b (?! \s fan ) (?! \s game ) (?! \s score ) (?! \s player ) /x) { print "Vulgar: '$s'\n"; } }
      -Darin

        The \s should probably be \s+, and

        (?! \s+ fan ) (?! \s+ game ) (?! \s+ score ) (?! \s+ player )
        is slower than
        (?! \s+ (?: fan | game | score | player ) )

        This factors out the constant \s+, and it uses | which probably has a lower overhead than (?!...). Furthermore, alternations of constant strings can be highly optimized by re engine modifications demerphq added to 5.9. (I don't think those particular strings can be optimized, though.)

Re^2: Regex solution needed
by spivey3587 (Acolyte) on Feb 23, 2007 at 17:16 UTC
    Awesome suggestion, and yes, I do need boundaries. Thank you. Now, I just have to evaluate whether the ratio of legitimate to false positives warrants keeping it in the dictionary. I figured with a tuned solution like the one you suggest, I can get the accuracy good enough to keep it. Considering that team name is one of the only legitimate uses for that word on our boards, I really wanted to find a solution in order to cover the thousands of negative uses I've encountered for it.
Re^2: Regex solution needed
by ikegami (Patriarch) on Feb 23, 2007 at 19:01 UTC

    Couldn't that big regexp be simplified to /\b cock s?/x?

    ( Nevermind, yours doesn't censor "game cocks". I was thinking "gamecocks". )

      Correct. Users chatting about the team will either incorrectly separate it into two words, or use a shortened version such as 'the cocks game'.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://601772]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2023-06-11 00:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How often do you go to conferences?






    Results (40 votes). Check out past polls.

    Notices?