I am writing a quick script that will parse over my Apache access log and print out the most recent file accesses. The gotcha comes in with the fact that I want it to only print out the most recent file accessed from each unique IP, and I want the output sorted by date.
#!/usr/bin/perl
use warnings;
use strict;
use Date::Manip;
use vars qw(%ipHash);
while(<DATA>)
{
/ ^((\d{1,3}\.){3}\d{1,3}) # grab the IP address into $1
\s\-\s\-\s\[
(\d\d\/\w{3}\/\d\d(\d\d\:){3}\d\d) # grab the date into $3
\s\-\d{4}\]\s"\w{1,4}\s
([\/|\w|\.|_]+) # grab the file path into $5
/x
and $ipHash{&UnixDate($3,"%s")} = [$1, $3, $5];
}
print join "\n",
map
{$ipHash{$_}[0] . " => " . $ipHash{$_}[1] . "\t" . $ipHash{$_}
+[2]}
sort keys %ipHash;
__DATA__
209.36.83.252 - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%%35%63
+../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 303
209.36.83.252 - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%%35c..
+/winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 303
209.36.83.252 - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%25%35%
+63../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 313
209.36.83.252 - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%252f..
+/winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 313
68.9.44.75 - - [17/Oct/2002:06:50:34 -0400] "GET /phpMyAdmin HTTP/1.1"
+ 301 322
68.9.44.75 - - [17/Oct/2002:06:50:34 -0400] "GET /phpMyAdmin/ HTTP/1.1
+" 200 898
68.9.44.75 - - [17/Oct/2002:06:50:36 -0400] "GET /phpMyAdmin/left.php?
+lang=en-iso-8859-1&convcharset=iso-8859-1&server=1 HTTP/1.1" 200 1024
129.22.39.158 - - [17/Oct/2002:18:05:10 -0400] "OPTIONS / HTTP/1.1" 20
+0 0
160.79.211.121 - - [17/Oct/2002:19:51:31 -0400] "GET /default.ida?NNNN
+NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+NNNNNNNNNNNNN%u9090%u6858%ucbd3%u7801%u9090%u6858%ucbd3%u7801%u9090%u
+6858%ucbd3%u7801%u9090%u9090%u8190%u00c3%u0003%u8b00%u531b%u53ff%u007
+8%u0000%u00=a HTTP/1.0" 400 303
129.22.82.8 - - [17/Oct/2002:20:37:10 -0400] "GET /index.php HTTP/1.1"
+ 200 25430
129.22.82.8 - - [17/Oct/2002:20:37:10 -0400] "GET /index.php?=PHPE9568
+F35-D428-11d2-A769-00AA001ACF42 HTTP/1.1" 200 4440
129.22.82.8 - - [17/Oct/2002:20:37:10 -0400] "GET /index.php?=PHPE9568
+F34-D428-11d2-A769-00AA001ACF42 HTTP/1.1" 200 2962
129.22.82.8 - - [17/Oct/2002:21:25:44 -0400] "GET / HTTP/1.1" 200 2673
129.22.82.8 - - [17/Oct/2002:21:25:44 -0400] "GET /manual/images/apach
+e_pb.gif HTTP/1.1" 404 302
That code produces the following output:
209.36.83.252 => 17/Oct/2002:05:53:17 /scripts/..
68.9.44.75 => 17/Oct/2002:06:50:34 /phpMyAdmin/
68.9.44.75 => 17/Oct/2002:06:50:36 /phpMyAdmin/left.php
160.79.211.121 => 17/Oct/2002:19:51:31 /default.ida
129.22.82.8 => 17/Oct/2002:20:37:10 /index.php
129.22.82.8 => 17/Oct/2002:21:25:44 /manual/images/apache_pb.gif
Now, ideally, I would not want the IP's repeated. Rather, I just want to see the last file accessed by that IP. So, the output would look like:
209.36.83.252 => 17/Oct/2002:05:53:17 /scripts/..
68.9.44.75 => 17/Oct/2002:06:50:34 /phpMyAdmin/
160.79.211.121 => 17/Oct/2002:19:51:31 /default.ida
129.22.82.8 => 17/Oct/2002:21:25:44 /manual/images/apache_pb.gif
But, to maintain my sorting by date, I key the hash by the Unix timestamp and not the IP. Would I need to set up a dualing hash thing so I can sort by date but keep only one entry for each IP address?
I just can't seem to wrap my head around this and wondered if any monks had some nifty ideas.
Thanks,
enoch
P.S. You gotta love those 'default.ida?NNNNNNN' entries.