Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

RFC: spam reporting assistant

by davies (Prior)
on Dec 14, 2007 at 15:37 UTC ( [id://657057]=perlmeditation: print w/replies, xml ) Need Help??

I've written some code that saves me time. I hope that, in the long run, it will save me more time than it took me to write it. But as I'm still new to Perl, I would appreciate any advice on where I might have gone wrong. There are a couple of areas where I'm sure things could be done more elegantly, and I have documented these.

In fact, I think I have documented everything pretty thoroughly. So I would be just as grateful for comments on the documentation as on the code or style. Feel free to go all perlmonks on me or attempt a Socratic dialogue - I'm trying to learn. I have tried to change all the http references to PM nodes to the appropriate id format, but I may have missed some, and I want to keep them in http format at my end. In fact, the id format references don't seem to work very well in a code block. Nonetheless, I'd prefer to be too conservative than anything else.

The copyright clause is one I have used for other stuff. I resent people making money from my work while I'm on the dole! Obviously, I can be /msged here.

TIA and regards,

John Davies
use strict; use warnings; use diagnostics; =head1 Synopsis Does the donkey work for advising banks of phishing spams. Intended fo +r use by people who can read - at least at a basic level - email headers +. LARTs are explained here: http://catb.org/jargon/html/L/LART.html =head1 Version Version 1.0 - first release version =head1 Description =head2 Inputs =head3 Handoffs The user may enter a number for the handoffs. This will be included (i +f > 0) at the top of the output, advising that ''the first (n) Received: headerZ<>(s) (is anŠare) internal handoffZ<>(s) at my ISP''. If this i +tem is blank or zero, no such information will be produced. =head3 Insertion point The insertion point is copied in from the email. I use Thunderbird, in which the most reliable version always appears as a dotted quad. To the best of my knowledge and belief, this is not a function of Thunderbird, but of standard email practices in terms of the RFCs. It is the dotted quad format that should be entered. It is up to the user to determine the insertion point. It is always possible for an ISP to be assigned new IP addresses and to use these in handoffs, but the user is expected to be able to identify these - at least after parsing them - and to tell the bank what they are in the LART. If multiple addresses are parsed, that is up to the user. The user is able to save lines from multiple insertion point parses, but this is probably a bad idea. A line will be produced in the output that will quote the IP address being parsed as the insertion point. It is up to the user to decide whether to include this in the LART, but it is something I always do. =head3 Landing point The landing point will normally be in an alphaZ<>(numeric) form. There is no need to look for a dotted quad, as there is for the insertion point. The parse of the landing point will normally do two things. First, it will identify the IP address of the landing point, putting this in the output. Then it will do a whois on that IP address, and put the resulting information in the output as well. A spam may have more than one landing point, or it may take several parses to get the information needed. On the assumption that there will be more than one more often than it will be necessary to do multiple parses of a single landing point, the landing points are numbered. The first will appear simply as ''Landing point:''. Subsequent ones will appear as ''Landing point 2:'' or whatever the number is. If you are doing multiple parses of a single landing point, you will probably want to correct this manually in the final LART. A blank line will automatically be inserted in the LART to separate each landing point. If there are multiple parses of a single landing point, these blank lines may also have to be removed. =head3 Server The whois server defaults to whois.completewhois.com. This can be changed by moving a different server to the top of the list in the __DATA__ section. The user can select a different server at any time - there is no need to use the same server for all queries, even within a single LART. The server timeout is set to 10 seconds, but seems to take more like 20. This may be due to processing time in Perl, but parses that work typically take under 2 seconds, so there would have to be a lot of time spent in error handling routines. The following problems exist with various servers: completewhois: Reportedly down a lot of the time. LACNIC sometimes blocks their queries, so it may be necessary to use LACNIC directly. geektools: Limits you to 50 queries a day (unless you make special arrangements with them). Also reports YOUR IP address in the output, which can confuse you. domaintools: Seems to be dead at the time of writing. networksolutions: Output contains a lot of advertising. abuse: Intended to give Postmaster@ addresses. Useful if you want to LART the hosts of the injection point or landing point. However, these are usually chosen by spammers for their unresponsiveness. The bank is more likely to get action. YMMV. All others currently listed: Cover only their own subsets of whois. Users can enter their own whois server only by adding them in the __DATA__ block. If this is invalid, the error should be trapped. =head3 Banks A list of known banks is on the system, with the phish reporting address of each and part of the SpamGourmet address I give them to reply to. Citibank reject my emails because they look like spam! Of course they bleeping do. They contain the phishing spam I am trying to help them deal with. ''Hey, we aren't getting any more phishes reported! We must be doing something right!''. Not every organisation on this list is a bank in the strict sense. PayPal isn't. However, most are banks, and if they can be phished, I think it makes sense to advise them. Two ''entry'' boxes appear to the right of the list box containing the list of banks. These contain the information for the selected bank. Clicking on either of these entry boxes copies the contents to the clipboard, replacing anything already there. =head4 ''From'' information This is probably useless to most people. I never give out my real email address. Instead I give out SpamGourmet addresses. This means that anything coming to my real address without SpamGourmet headers is 100% spam. I can also prevent spammy suppliers getting their rubbish through to me by setting that individual address to allow nothing more through. By holding the data here, I avoid the danger of giving two addresses to the same organisation. This should be no problem even if it does happen, but it's tidier if it's kept under control. =head4 ''To'' information This is more likely to be of use to the average user. Every bank seems to have a different format of address to report phish attempts. It helps to keep a record, rather than going through past emails, especially if the user does not get frequent phishes for a bank. =head3 Storing marked lines This button can be hit only when there are some lines to mark. However, it is not necessary to have marked any lines. This command will copy any lines that have been ticked into the final LART, but will also clear the lines. So if a parse has returned useless information, the output can be cleared and a new attempt made by hitting this button with no lines marked. However, this isn't necessary, as any new parse will start by clearing out old data. Once lines have been stored, they cannot be removed except by starting a new LART. Hitting the ''Prepare LART'' button will do this - the output to the clipboard can be ignored. =head3 Preparing the LART Clicking on the ''Prepare LART'' button will copy the stored lines of the various parses into the Windows clipboard, and will clear all information ready for a new LART. =head2 Outputs All outputs are to the Losedows clipboard. They are produced, as described under the relevant sections above, by: =over 4 =item * Clicking on the ''From'' entry box =item * Clicking on the ''To'' entry box =item * Clicking on the ''Prepare LART'' button =back Any of these actions replaces whatever is already in the clipboard. =head1 Author Dr. John Davies 51 Elephant Lane London SE16 4JD =head1 To Do I want to improve on the mindbogglingly ugly kludge that removes the scrollbars in clearframe. The only thing to be said in its favour is that it works. My instincts tell me there's got to be a better way of doing it than $hoaBank{$sFields[0]} = [ $sFields[1], $sFields[2] ]; When I know how to, I plan to build a series of tests for the programm +e. Validate the contents of the __DATA__ block, so that invalid data does +n't end up in the list boxes. This may not be possible. Different whois server +s may accept different parameters, and the current system means that users c +an put whois parameters in with the server name. A server can quite reaso +nably appear twice in the list, with different parameters built in. One fine day, it might be nice to maintain external files of whois ser +vers and bank information via the programme, instead of holding them within + the programme as __DATA__. It might also be nice to put in a command line option to ignore the '' +From'' data, as this is something that only a few people like me are likely t +o want. =head1 Bugs I use OpenIDE. It seems to me that it doesn't always know when to (and + not to) colour code purple. #== appears at the end of several lines, and s +tops the purple colour. =head1 Copyright All aspects of this programme are my copyright. I don't care what peop +le do with it for fun, but if anyone is making money from it or any development they may have made from it, I want my cut. This includes, but is not restricted to, putting it on a disc for which ''only a distribution charge'' (or similar weasel words) is charged, or which i +s free if you buy something else, like a magazine. If you put it on a we +b site for free download, that's fine, but if you charge for access, I want my cut, even if you don't charge specifically for this. If in doubt, contact me first. Don't assume that I will accept your payment scales. Don't assume that you can email me - I have aggressive spam blocking in place. If contacting me isn't worth a stamp to you, it's not worth bandwidth and disc space to me. If you don't like any of thi +s, don't use it. =head1 Code =cut use Tk; use Tk::Pane; use Tk::JComboBox; use IO::Socket; use Net::Nslookup; use Win32::Clipboard; =head2 Naming Conventions Hungarian notation has been used. Out of laziness, a single character has been used most of the time. However, to avoid possible ambiguity, three characters have been used in some places. There is one exception. The variable ''$sock'' has been copied from another source, documented in the code comments. I'm too lazy to work out what it contains, so I haven't used Hungarian notation for it. I know that the use of capital letters in variable names is deprecated in ''Perl Best Practices'', although I haven't read the book, but I find them easier to read. =head2 Global Variables Global variables are used for: =over 4 =item * Data that are needed at all stages of processing =item * Tk widgets =item * Variables bound to Tk widgets =back =cut my @sLart; #Assembly for the elements of the email my @aoaCheck; #Array of checkbox data #[][0] is the checkbox itself #[][1] is the 'ticked' variable #[][2] is the text my $nLandings = 0; #Number of landing points analysed my $sServer; #Whois server my %hoaBank; #Hash of Arrays of banks my $sBank; #Selected bank from combobox my $sTo; #Email address of $sBank my $sFrom; #Spamgourmet prefix for $sBank my $wMain = MainWindow->new; $wMain->geometry('890x500'); $wMain->resizable(0,0); my $fraChk = $wMain->Scrolled('Pane', -scrollbars => 'oe', -height => 440, -width => 890, -sticky => 'w') ->place(-x => 0, -y => 60);#== my $lHandoffs = $wMain->Label(-text => 'Internal handoff(s)') ->place(-x => 0, -y => 0); #== my $eHandoffs = $wMain->Entry(-width => 3) ->place(-x => 100, -y => 0); #== my $lInsertion = $wMain->Label(-text => 'Insertion point') ->place(-x => 130, -y => 0); #== my $eInsertion = $wMain->Entry(-width => 16) ->place(-x => 220, -y => 0); #== my $butInsertion = $wMain->Button( -text => 'Parse', -command => sub{insertion($eInsertion->g +et())}) ->place( -x => 330, -y => 0); #== my $lLanding = $wMain->Label(-text => 'Landing point') ->place(-x => 380, -y => 0); #== my $eLanding = $wMain->Entry(-width => 30) ->place(-x => 460, -y => 0); #== my $butLanding = $wMain->Button(-text => 'Parse', -command => sub{landing($eLanding->get() +)}) ->place(-x => 655, -y => 0); #== my $butStore = $wMain->Button(-text => 'Store marked lines', -state => 'disabled', -command => sub{store()}) ->place(-x => 703, -y => 0); #== my $butLart = $wMain->Button(-text => 'Prepare LART', -state => 'disabled', -command => sub{lart()}) ->place(-x => 810, -y => 0); #== my $cboServer = $wMain->JComboBox( -textvariable => \$sServer ) ->place(-x => 0, -y => 30); #== my $lBank = $wMain->Label(-text => 'Bank') ->place(-x => 300, -y => 30); #== my $cboBank = $wMain->JComboBox( -textvariable => \$sBank, -selectcommand => sub {$sFrom = $hoaBank{$sBank}[0]; $sTo = $hoaBank{$sBank}[1]; } ) ->place(-x => 330, -y => 30); #== my $eTo = $wMain->Entry(-width => 30, -textvariable => \$sTo) ->place(-x => 460, -y => 30); #== $eTo ->bind('<Button-1>', sub {Win32::Clipboard::Set($sTo); } ); my $eFrom = $wMain->Entry(-width => 30, -textvariable => \$sFrom) ->place(-x => 670, -y => 30); #== $eFrom ->bind('<Button-1>', sub {Win32::Clipboard::Set($sFrom); } ); =head2 Bare block before MainLoop; This is written as a bare block to ensure that the variable created inside it does not become a global variable. This block reads the __DATA__ block and puts the contents into the banks and servers listboxes, depending on the format. Blank lines are ignored. Lines with an ''@'' symbol are assumed to be bank email addresses, of the fo +rm name (white space) email@domain (white space) from Deviating from this format will produce invalid information in the listbox or ''entry'' boxes for the banks. Non-blank lines without an ''@'' symbol are put into the list of servers. Including white space on such lines is likely to cause parses to fail or produce invalid data if that data line is selected as the server, unless they are valid server options. The first line inserted into the list of servers is also used as the default server. =cut { my $nDefault = 1; while (<DATA>) { chomp; if (/@/) { my @sFields = split /\s+/, $_; $hoaBank{$sFields[0]} = [ $sFields[1], $sFields[2] ]; $cboBank->addItem($sFields[0]); } elsif (length) { $cboServer->addItem($_, -selected => $nDefault); $nDefault = 0; } } } MainLoop; =head2 Subroutine ''output'' Although ''place'' is used everywhere else for geometry management, it has been necessary to use ''grid'' here. This is because ''place'' doe +s not coexist happily with scrollbars, as discussed here: [id://id=651906]. According to ''Mastering Perl/Tk'' by Steve Lidie and Nancy Walsh, First Edition, published January 2002, + it is permissible to use different geometry managers in different frames. =head3 Inputs Array of data usually produced by a whois or nslookup query, but somet +imes as a failure message. =head3 Outputs =over =item * To screen - the input array with a checkbox at the start of each line =item * @aoaCheck - contains checkboxes (the Tk widget) in element 0, the vari +able to indicate if the box is ticked in element 1, and the text in element + 2 =item * The ''store marked lined'' button is enabled =back =head3 Called by =over =item * landing =item * insertion =back =head3 Calls made =over =item * clearframe =back =cut sub output { clearframe(); for (@_) { chomp; push @aoaCheck, [$fraChk->Checkbutton( -text => $_, -font => 'courier 10', -justify => 'left' ) ->grid(-column => 0, -row => $#aoaCheck + 1, -sticky => 'w' ), undef, $_ ]; #The variable can't be set in the 'push' statement. Until it has been #comleted, the value of $#aoaCheck is not incremented. Either you have #to increment it by one, meaning that the variable is misaligned, or #you get the first one being assigned to index -1, which the compiler #dislikes, even though it seems to get corrected later! $aoaCheck[$#aoaCheck][0]->configure( -variable => \$aoaCheck[$#aoaCheck][1]); } } =head2 Subroutine ''landing'' =head3 Inputs A single datum, which may be in the form of a dotted quad or of an alphaZ<>(numeric) URL. This is entered in the landing entry box, and passed as a parameter by the Tk call. =head3 Outputs An array of text. If not a dotted quad, then the resolution of the web site to a dotted +quad is produced. The dotted quad (whether entered or resolved) is resolved + and this resolution is included in the output array. =head3 Called by Tk. Bound to the "Parse" button next to the landing entry box. =head3 Calls made =over =item * output =back =cut sub landing { my @sOutput; my $sLanding = shift(@_); $nLandings++; my $sHeader = "Landing point"; if ($nLandings > 1) { $sHeader = "$sHeader $nLandings" } $sHeader = "$sHeader" . ':' . $sLanding; @sOutput= $sHeader; #[id://221512] #warns against using a regex to parse IP addresses. #However, Thunderbird always provides addresses #as dotted quads, nslookup seems to do the same, #and since the node offers no suggested #code for doing better, this will just have to do. if ($sLanding =~ /^(\d+\.){3}\d+$/) { push (@sOutput, whois ($sLanding)); } else { if (defined($sLanding)) { my $sResolved = nslookup $sLanding; unless ($sResolved eq "") {push(@sOutput, $sLanding . " resolves + to " . $sResolved)}; push (@sOutput, whois($sLanding)); unless ($sResolved eq "") {push (@sOutput , whois ($sResolved))} } else { push (@sOutput, "Undefined landing point"); } } $butStore->configure(-state => 'normal'); output @sOutput; } =head2 Subroutine ''insertion'' =head3 Inputs A single datum, which should be in the form of a dotted quad. This is entered in the insertion entry box, and passed as a parameter by the T +k call. =head3 Outputs An array of text =head3 Called by Tk. Bound to the "Parse" button next to the insertion entry box. =head3 Calls made =over =item * output =back =cut sub insertion { my @sOutput; my $insertion = shift(@_); #The insertion point must be a dotted quad. If it's not, it's not #a reliable insertion point per my version of Thunderbird if ($insertion =~ /^(\d+\.){3}\d+$/) { @sOutput = ("Insertion point: $insertion", whois ($insertion)); } else { $sOutput[0] = "No insertion point identified"; } $butStore->configure(-state => 'normal'); output @sOutput; } =head2 Subroutine ''store'' =head3 Inputs No parameters passed. Uses @aoaCheck =head3 Outputs =over =item * @sLart (may already contain data) =item * ''Prepare LART'' button enabled if any lines actually stored =back =head3 Called by Tk. Bound to ''Store marked lines'' button. =head3 Calls made =over =item * clearframe =back =cut sub store{ $butStore->configure(-state => 'disabled'); for (0 .. $#aoaCheck) { if ($aoaCheck[$_][1]) { push (@sLart, $aoaCheck[$_][2]); $butLart ->configure(-state => 'normal'); } } push (@sLart,""); clearframe(); } =head2 Subroutine ''whois'' Plagiarised from [id://=23082] =head3 Inputs A single datum, which may be a dotted quad or an alphaZ<>(numeric) URL +. The selected whois server is used to parse the data passed. =head3 Outputs An array of text, containing the whois information on the input =head3 Called by =over =item * landing (3 places) =item * insertion =back =head3 Calls made None =cut sub whois { use vars qw/$sock $sPort @sData/; $sPort = "43"; my $sTarget = $_[0]; die "Whois Usage: whois(<domain>)" unless(defined($sTarget)); $sock = IO::Socket::INET->new(PeerAddr => $sServer, PeerPort => $sPort, Proto => 'tcp', Timeout => 10); eval { print $sock $sTarget,"\r\n" }; if ($@) { @sData =("Error while parsing. Perhaps the whois server is down?", +$@); $butStore->configure(-state => 'normal'); } else { @sData = <$sock>; } undef($sock); @sData; } =head2 Subroutine ''lart'' =head3 Inputs =over =item * No parameters passed =item * @sLart used as input =item * $nHandoffs used as input =back =head3 Outputs =over =item * LART put into Losedows clipboard =item * $eHandoffs (handoffs entry box) cleared =item * $eLanding (landing point entry box) cleared =item * $eInsertion (insertion point entry box) cleared =item * @sLart cleared =item * @aoaCheck cleared =item * $nLandings set to zero =item * ''Prepare LART'' button disabled =back =head3 Called by Tk. Bound to ''Prepare LART'' button. =head3 Calls made None =cut sub lart{ my $nHandoffs = $eHandoffs->get(); my $sClip = ""; if ($nHandoffs == 1) { $sClip = "The first Received: header is an internal handoff at my +ISP\r\n" } elsif ($nHandoffs > 1) { $sClip = "The first $nHandoffs Received: headers are internal hand +offs at my ISP\r\n" } for (@sLart) { $sClip = "$sClip\r\n$_"; } Win32::Clipboard::Set($sClip); $eHandoffs->delete(0, 'end'); $eLanding->delete(0, 'end'); $eInsertion->delete(0, 'end'); undef @sLart; undef @aoaCheck; $nLandings = 0; $butLart ->configure(-state => 'disabled'); } =head2 Subroutine ''clearframe'' =head3 Inputs None =head3 Outputs =over =item * @aoaCheck - cleared & undefined =back =head3 Called by =over =item * output =item * store =back =head3 Calls made None =cut sub clearframe { for (0 .. $#aoaCheck) { $aoaCheck[$_][0]->gridForget( ); } undef @aoaCheck; #The following kludge succeeds in getting rid of the scrollbars afte +r #clearing the frame. I can't find any other way of doing it. my $chkDummy = $fraChk->Checkbutton( -text => '') ->grid(-column => 0, -row => 0); $wMain->update; $chkDummy->gridForget( ); $wMain->update; } close DATA; exit; =head1 Data =head2 Blank lines Ignored =head2 Email address lines Defined as being any line containing the ''@'' sign Expected format: Bank (white space) Email address (white space) From information =head2 Server lines Defined as being any non-blank line not containing the ''@'' sign The entire line is treated as being the server. It is possible to include multiple copies of the same server with different options. =head2 Other lines Assumed to be server lines. =cut __DATA__ whois.completewhois.com whois.geektools.com whois.domaintools.com whois.networksolutions.com whois.abuse.net whois.internic.net whois.arin.net whois.ripe.net whois.apnic.net whois.lacnic.net whois.apnic.net whois.afrinic.net whois.denic.de Citibank NotAccepted@braindead CitibankPhish CitizensBank fraud_prevention@citizensbank.com CitizensBankPhish HSBC phishing@hsbc.com HSBCPhish Nationwide phishing@nationwide.co.uk NationwidePhish NatWest phishing@natwest.com NatWestPhish PayPal spoof@paypal.com PayPalPhish RBS phishing@rbs.co.uk RBSPhish

Replies are listed 'Best First'.
Re: RFC: spam reporting assistant
by Jaap (Curate) on Dec 18, 2007 at 15:42 UTC
    Wow that is quite long. One comment: In stead of this:
    my @sLart; #Assembly for the elements of the email my @aoaCheck; #Array of checkbox data #[][0] is the checkbox itself #[][1] is the 'ticked' variable #[][2] is the text my $nLandings = 0; #Number of landing points analysed my $sServer; #Whois server my %hoaBank; #Hash of Arrays of banks my $sBank; #Selected bank from combobox my $sTo; #Email address of $sBank my $sFrom; #Spamgourmet prefix for $sBank
    it is usually better to declare your variables where you use them first (that way, and with strict, you get a nice warning when you double-use a variable name).
      The length is mostly down to documentation, as I'm sure you realise. Without the POD statements, it's more like 250 lines, and can be reduced still further by putting all assignments (especially in the Tk statements) on a single line. Isn't that always the way with documentation?

      As far as the global variables are concerned, about half of them are bound to Tk widgets. I wanted to keep like with like, and declare them before I started declaring the widgets. On the same basis, I wanted all the global variables declared together. "All the rotten eggs in one basket". The code is laid out thuswise:
      use
      global variables
      Tk
      Bare block to load __DATA__
      MainLoop
      Subroutines

      If this is bad style, I'd like to change it. But doing it this way fits my brain, so if there are reasons, I'd love to know them, whether by direct explanation or by reference. I can see that declaring a variable before it's needed can mean that it gets used accidentally. But given that there are only 8 of them, I felt that having them in a single block would mean that there was a reduced chance of double use, and an increased chance of seeing what each declaration was intended to achieve.

      Thanks and regards,

      John Davies

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://657057]
Approved by Arunbear
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-04-20 02:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found