Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Update

2002-04-09: The code has been updated to reflect changes in the SpamAssassin module. Any changes have been noted in the comments, and the code as it originally appeared has been retained in the comments as well.

Purpose

My purpose in writing this tutorial is not to extensively cover the capabilities of Mail::Audit or Mail::SpamAssassin. My purpose is to show how I implemented these tools in order to address my filtering needs. I believe that my needs are not unique, and therefore, I hope this tutorial proves valuable in providing a step-by-step guide to using Perl for your mail filtering requirements.

Acknowledgements

Much of the following is taken from the applicable CPAN pages and Simon Cozens' page, as well as a conglomeration of other pages, including the CPAN page for Mail::Procmail, which is not actually used here. To the authors of these pages, I am eternally grateful.

The History of My Problem (Or Identifying The Itch)

Recently, in the space of about 30 minutes, I received in excess of 15 email from the same person carrying the same subject and body. Over the course of the next week, I received over 200 such email. Over the following month . . . well, you get the picture.

I use fetchmail, Procmail and Mutt for processing my email. Naturally, fetchmail retrieves my email, Procmail filters my email into the appropriate folders, and Mutt reads my email.

Fetchmail serves my needs well, and I rarely have any complaints with Mutt. Procmail, however, is another story.

Procmail, for those who are unfamiliar, relies on recipies for filtering email. For example:

:0 * ^From: lll@hotmail\.com friends

will filter all email from lll@hotmail.com into the friends folder. This is known as a recipe, and must be called from ~/.procmailrc.

When I began experiencing the aforementioned flood of spam, I configured the following and placed it in my .procmailrc:

:0 * ^From:. mortgagef6e@canada\.com /dev/null

Naturally, I expected all email from mortgatef6e@canada.com to be routed into oblivion. However, for reasons I have yet to determine (and believe me when I say that I worked long and hard on this problem), it did not work. I continued to experience the flood of email into my incoming folder.

The real reason I couldn't solve this problem is because, in my oh-so-humble opinion, Procmail's receipes are too damned difficult. Navigating this maze is the equivalent of a 4-hour college course.

And, frankly, filtering email just shouldn't be that difficult.

Identifying the Solution (Or Finding a Back-Scratcher)

So, I began searching for a solution. I mean, I know enough Perl to get through the day. And Perl does excel at pattern matching. And filtering email is nothing more than pattern matching, right?

A quick search of CPAN led me to Mail::Audit. Perfect! A Perl module to filter email. Additionally, the author has provided a fairly detailed example of using Mail::Audit.

As I began writing the script, it quickly became obvious that, while I could easily identify and route my email, I had no mechanism for filtering spam. And, the repetitious email that started this whole adventure was spam. I needed a solution which would allow me to separate the spam from my legitimate email, while filtering my legitimate email into the appropriate folders. I certainly did not want to reinvent the wheel and have to figure out all of the patterns and tricks of the trade used by spammers. Following tilly's advice that I should assume that Perl has what I want, I once again hit CPAN.

A little more searching lead me to Mail::SpamAssassin, a plugin for Mail::Audit that has a very high success rate for filtering spam. SpamAssassin is available as a command-line utility, a daemon, and (obviously) a perl module.

Now all I needed was to modify my script to bring these pieces together.

The Script (Or Scratching The Itch)

Following is the commented script in its entirety.

#!/usr/bin/perl -w # # program: filter.pl # description: filters email into appropriate folders use diagnostics; use strict; use Mail::Audit; use Mail::SpamAssassin; # # The default mailbox for delivery # my $default = "/var/spool/mail/".getpwuid($>) is also an option. # However, I keep all of my email in ~/mail. Additionally, while I # have a ~/mail/mbox, I route *all* of my email to a specific # folder. my mbox should never contain any email, and only exists # for asthetic reasons. # my $folder = "$ENV{HOME}/mail/"; # #################################################################### # Filter spam first # We knock the spam out of the way immediately. This saves us from # wasting time processing mail which is obviously spam. # # Spam is swept to its own folder, $ENV{HOME}/mail/spam.incoming. # Mail::SpamAssassin will prepend *****SPAM***** to the subject line # of the email. Additionally, it prepends something similar to the # following paragraph to the body of the email (this, as well as # pattern matches, can be modified by editing the # spamassassin.cf file): # # SPAM: -------------------- Start SpamAssassin results ------------- # SPAM: This mail is probably spam. The original message has been alt +ered # SPAM: so you can recognise or block similar unwanted mail in future, + using # SPAM: the built-in mail filtering support in your mail reader. # SPAM: # SPAM: Content analysis details: (7.9 hits, 5 required) # SPAM: Hit! (2.1 points) BODY: /http\:\/\/\d+\.\d+\.\d+\.\d+\//is # SPAM: Hit! (2.5 points) BODY: Link to a URL containing "remove" # SPAM: Hit! (3.3 points) BODY: /click here.{0,100}<\/a>/is # SPAM: # SPAM: -------------------- End of SpamAssassin results ------------ #################################################################### # # This statement gets the next email from the queue # # my $item = Mail::SpamAssassin::MyMailAudit->new(); # The above line is the original code. MyMailAudit no # longer exists, so we rely on Mail::Audit to retrieve # the next email from the queue: # my $item = Mail::Audit->new(); # # This statement sets up our handle to SpamAssassin # my $spamtest = Mail::SpamAssassin->new(); # # Now we retrieve the status to determine whether the email is, # in fact, spam # my $status = $spamtest->check ($item); # # If the email is spam, write the email back with the aforementioned # subject and body modifications, then call the spam() subroutine # for processing (see end of script). # if ($status->is_spam ()) { $status->rewrite_mail (); spam("SpamAssassin",$folder); } #################################################################### # Mail::Audint initialization stuff #################################################################### # # If we get here, Spam::Assassin did not identify the email as spam # # Specify the location of our log file. We'll be writing several # status messages here. # open (LOG, ">$ENV{HOME}/syslog/.audit_log"); # # Get relevant fields from the message. These are pretty # self-explanatory. # my $from = $item->from(); my $to = $item->to(); my $cc = $item->cc(); my $subject = $item->subject(); my $body = $item->body(); chomp($from, $to, $cc, $subject); #################################################################### # Note that we just retrieved $body. Although I # don't use it here, this provides the ability to # filter based on the content of the body of the # email. For example: # # if ($body =~ /some_pattern/i) { #do stuff }; #################################################################### # # Start logging. # print LOG ("From: $from\n"); print LOG ("To: $to\n"); print LOG ("Subject: $subject\n"); #################################################################### # End initialization stuff #################################################################### # I know certain people. We all do. They're L-O-S-E-R-S. And, # frankly, I don't enjoy receiving email from them. The following # will identify these email addresses and route them immediately to # my trash folder (via the trash() subroutine). # for (qw(gar079@yahoo.ca badguy@loser.net nasty@whimp.org enemy@hate-u. +com)) { if ($from =~ /$_/) { trash("From a loser",$folder); } } # I have some programs that email me from various machines. I want # these email to be immediately routed to ~/mail/home. # if ($from =~ /\@exitwound.org/i) { $item->accept("$folder"."home"); } # Now we come to email lists and people who commonly send me email # (hi Mom!). First, we set up a hash. The key is a pattern to be # matched against the From: line. The content is the folder name # where the mail should be stored. # my %lists = ( "apache" => "apache", "buckaroo" => "buckaroo", "christianhusbands" => "christian", "kde-linux" => "kde", "lawtech" => "lawtech", "debian-user" => "linux", "linux" => "linux", "win4lin" => "linux", "lll\@hotmail" => "Lori", "perlbot" => "perlbot", "dynamite" => "metal", "80s_Rock_Metal" => "metal", "metal" => "metal", "screamsofabel" => "metal", "mavericks" => "MomDad", "hargrojj" => "MomDad", "mutt" => "mutt", "rl2" => "rl2", "focus-linux" => "security", ); # Here, we compare the From: field with each key of the hash and # store the email in the corresponding folder # for my $pattern (keys %lists) { if (($from =~ /$pattern/i) or ($to =~ /$pattern/i) or ($cc =~ /$pattern/i)) { $item->accept("$folder"."$lists{$pattern}"); } } # The following code checks whether the To: or CC: field contains the # phrase "shock." If not, it means that the email is being sent to a # list (which has not been identified in the previous section). # Therefore, if my email address is not in the To: or CC: field, I # assume that it is spam # if ($from !~ /shock/i and $cc !~ /shock/i) { spam("Apparently not to me",$folder); } # If we've made it this far, I'm not sure what it is. Therefore, I # store it in the Bulk folder. # $item->accept("$folder"."Bulk"); # Bye-bye # exit; ################ Subroutines ################ # # This subroutine handles anything identified as spam. It is called # thusly: # # spam("Reason for calling",$folder); # # The subroutine will store the email in the ~/mail/spam.incoming # folder. It will also print a message to the log file identifying: # # (1) The spam subroutine; # (2) The line number which called the spam subroutine; and # (3) The reason for calling (i.e. "Reason for calling"). # sub spam { my ($tag, $reason, $folder) = ("spam", @_); my $line = (caller(1))[2]; print LOG ("$tag [$line]: $reason\n"); $item->accept("$folder"."spam.incoming"); } # # This subroutine handles anything identified as trash. It is called # thusly: # # trash("Reason for calling",$folder); # # (1) The trash subroutine; # (2) The line number which called the trash subroutine; and # (3) The reason for calling (i.e. "Reason for calling"). # sub trash { my ($tag, $reason, %atts) = ("trash", @_); my $line = (caller(1))[2]; print LOG ("$tag [$line]: $reason\n"); $item->accept("$folder"."trash"); }
Procmail Modifications

The following modifications were necessary to my .procmailrc file in order to get this baby rolling. There may be better or more efficient ways to do this, and if so, I welcome the input.

# # The following will force all messages from Procmail to be logged in # ~/syslog/procmail # LOGFILE=$HOME/syslog/procmail # # Turn verbose logging and log abstract off, unless you're the # wordy type. # VERBOSE=off LOGABSTRACT=off # # From the procmailrc man page: # # By default, procmail returns an exitcode of zero (success) if it # successfully delivered the message or if the HOST variable was # misset and there were no more rcfiles on the command line; # otherwise it returns failure. Before doing so, procmail examines # the value of this variable. If it is set to a positive numeric # value, procmail will instead use that value as its exitcode. If # this variable is set but empty and TRAP is set, procmail will set # the exitcode to whatever the TRAP program returns. If this # variable is not set, procmail will set it shortly before calling # up the TRAP program. # # So, by setting EXITCODE to nothing, we can have procmail return # whatever exit code our filter.pl script determines is necessary. # EXITCODE= # # Point to our program to handle all of the filtering. As mentioned, # by running our program as a TRAP program (see the procmailrc docs # for more information about this). Procmail will assign the exit # code of our script to the MTA (sendmail, postfix, exim, etc.) that # called procmail. # TRAP=$HOME/bin/filter.pl # # The following is for safety purposes. All email is copied to this # file, so if something gets lost, you can retrieve it from here. # Once you're comfortable with your filter.pl, you can remove the # following two lines. :0: $HOME/syslog/mail

At this point, we're happening. A simple fetchmail -d 90 (or whatever), and we're good to go. fetchmail will retrieve the email, Procmail will receive it and invoke the fetch.pl script, which will filter the email accordingly.

Conclusion (Or The Itch Has Been Scratched)

I've been running this script for a few weeks now, and Spam::Assassin is proving to be very reliable. I'd estimate its accuracy somewhere around the high-90th percentile, and on many days, it's 100% accurate. In conjunction with the other filters I've added in the script, all spam that I receive is currently being filtered to the spam.incoming folder.

For me,

$item->accept("$folder"."home") if ($subject =~ /\@exitwound.org/i);

makes far more sense than

:0 * ^Subject:\/.*exitwound home

or whatever the hell the correct Procmail syntax might be. Who has time for that? Give me a good old Perl script any day. After all, filtering email just shouldn't be that hard.


In reply to A Beginner's Guide to Using Mail::Audit and Mail::SpamAssassin by shockme

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (4)
As of 2024-03-29 11:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found