Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re^2: Spam filtering and regular expressions

by fokat (Deacon)
on Jul 30, 2005 at 19:30 UTC ( #479638=note: print w/replies, xml ) Need Help??

in reply to Re: Spam filtering and regular expressions
in thread Spam filtering and regular expressions

I agree with jhourcle's words:

(...) distinctions are context sensitive (...)

This is totally true - spammers know this fact and do use it to get around spam filters built this way. One approach we're looking at, tries to use a _capped_ number of replacement sets (ie, perform just 1 (one) to l (ell) transation at a time) and evaluate each of them against the regular expressions.

The results we're getting with this are better than with just regular expressions, but not spectacular. There are more knobs to turn (how many replacements to perform and evaluate, what value should every match add to the score and what is the threshold, for instance) in addition to the set of regexes that are used to detect spam-flag phrases.

A similar approach could be implemented using (hairy, IMHO) regexes. Those regexes are likely much harder to maintain and I guess they might be more expensive than the described approach. However, no testing has been done because we do not have a satisfactory solution to benchmark against yet.

Oh... and UTF is going to make for a very, very large set of glpyhs.

Indeed. This is why you must cap the amount of replacements to do when using this method.

Best regards

-lem, but some call me fokat

  • Comment on Re^2: Spam filtering and regular expressions

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://479638]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2020-10-26 19:51 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (253 votes). Check out past polls.