Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Spam filtering and regular expressions

by jhourcle (Prior)
on Jul 30, 2005 at 14:46 UTC ( [id://479611]=note: print w/replies, xml ) Need Help??


in reply to Spam filtering and regular expressions

You might want to ask around at the spam tools mailing list. I used to read it religiously when I was responsibe for maintaining spam filters.

I'm guessing someone's probably already done what you describe. If they haven't, I would probably handle it like soundex, but instead of grouping letters that sound like, grouping glyphs that look alike. (note, I specifically didn't say try to get them to the 'right' value, because the (0Oo) and (1lIi) distinctions are context sensitive ... (100K! M3ds @ lO% 0ff!), and the true meaning doesn't really matter, unless you're trying to determine if it's intentionally obfuscated, as opposed to just a suspicious keywords.)

Oh... and UTF is going to make for a very, very large set of glpyhs.

  • Comment on Re: Spam filtering and regular expressions

Replies are listed 'Best First'.
Re^2: Spam filtering and regular expressions
by fokat (Deacon) on Jul 30, 2005 at 19:30 UTC

    I agree with jhourcle's words:

    (...) distinctions are context sensitive (...)

    This is totally true - spammers know this fact and do use it to get around spam filters built this way. One approach we're looking at, tries to use a _capped_ number of replacement sets (ie, perform just 1 (one) to l (ell) transation at a time) and evaluate each of them against the regular expressions.

    The results we're getting with this are better than with just regular expressions, but not spectacular. There are more knobs to turn (how many replacements to perform and evaluate, what value should every match add to the score and what is the threshold, for instance) in addition to the set of regexes that are used to detect spam-flag phrases.

    A similar approach could be implemented using (hairy, IMHO) regexes. Those regexes are likely much harder to maintain and I guess they might be more expensive than the described approach. However, no testing has been done because we do not have a satisfactory solution to benchmark against yet.

    Oh... and UTF is going to make for a very, very large set of glpyhs.

    Indeed. This is why you must cap the amount of replacements to do when using this method.

    Best regards

    -lem, but some call me fokat

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://479611]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (3)
As of 2024-04-16 13:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found