Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

comment on

( [id://3333] : superdoc . print w/replies, xml ) Need Help??
I am writing a quick script that will parse over my Apache access log and print out the most recent file accesses. The gotcha comes in with the fact that I want it to only print out the most recent file accessed from each unique IP, and I want the output sorted by date.

Here is my code:

#!/usr/bin/perl use warnings; use strict; use Date::Manip; use vars qw(%ipHash); while(<DATA>) { / ^((\d{1,3}\.){3}\d{1,3}) # grab the IP address into $1 \s\-\s\-\s\[ (\d\d\/\w{3}\/\d\d(\d\d\:){3}\d\d) # grab the date into $3 \s\-\d{4}\]\s"\w{1,4}\s ([\/|\w|\.|_]+) # grab the file path into $5 /x and $ipHash{&UnixDate($3,"%s")} = [$1, $3, $5]; } print join "\n", map {$ipHash{$_}[0] . " => " . $ipHash{$_}[1] . "\t" . $ipHash{$_} +[2]} sort keys %ipHash; __DATA__ - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%%35%63 +../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 303 - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%%35c.. +/winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 303 - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%25%35% +63../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 313 - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%252f.. +/winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 313 - - [17/Oct/2002:06:50:34 -0400] "GET /phpMyAdmin HTTP/1.1" + 301 322 - - [17/Oct/2002:06:50:34 -0400] "GET /phpMyAdmin/ HTTP/1.1 +" 200 898 - - [17/Oct/2002:06:50:36 -0400] "GET /phpMyAdmin/left.php? +lang=en-iso-8859-1&convcharset=iso-8859-1&server=1 HTTP/1.1" 200 1024 - - [17/Oct/2002:18:05:10 -0400] "OPTIONS / HTTP/1.1" 20 +0 0 - - [17/Oct/2002:19:51:31 -0400] "GET /default.ida?NNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNN%u9090%u6858%ucbd3%u7801%u9090%u6858%ucbd3%u7801%u9090%u +6858%ucbd3%u7801%u9090%u9090%u8190%u00c3%u0003%u8b00%u531b%u53ff%u007 +8%u0000%u00=a HTTP/1.0" 400 303 - - [17/Oct/2002:20:37:10 -0400] "GET /index.php HTTP/1.1" + 200 25430 - - [17/Oct/2002:20:37:10 -0400] "GET /index.php?=PHPE9568 +F35-D428-11d2-A769-00AA001ACF42 HTTP/1.1" 200 4440 - - [17/Oct/2002:20:37:10 -0400] "GET /index.php?=PHPE9568 +F34-D428-11d2-A769-00AA001ACF42 HTTP/1.1" 200 2962 - - [17/Oct/2002:21:25:44 -0400] "GET / HTTP/1.1" 200 2673 - - [17/Oct/2002:21:25:44 -0400] "GET /manual/images/apach +e_pb.gif HTTP/1.1" 404 302
That code produces the following output: => 17/Oct/2002:05:53:17 /scripts/.. => 17/Oct/2002:06:50:34 /phpMyAdmin/ => 17/Oct/2002:06:50:36 /phpMyAdmin/left.php => 17/Oct/2002:19:51:31 /default.ida => 17/Oct/2002:20:37:10 /index.php => 17/Oct/2002:21:25:44 /manual/images/apache_pb.gif
Now, ideally, I would not want the IP's repeated. Rather, I just want to see the last file accessed by that IP. So, the output would look like: => 17/Oct/2002:05:53:17 /scripts/.. => 17/Oct/2002:06:50:34 /phpMyAdmin/ => 17/Oct/2002:19:51:31 /default.ida => 17/Oct/2002:21:25:44 /manual/images/apache_pb.gif
But, to maintain my sorting by date, I key the hash by the Unix timestamp and not the IP. Would I need to set up a dualing hash thing so I can sort by date but keep only one entry for each IP address? I just can't seem to wrap my head around this and wondered if any monks had some nifty ideas.


P.S. You gotta love those 'default.ida?NNNNNNN' entries.

In reply to Parsing Apache Log to Get Most Recent File Access by enoch

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.