Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
I am writing a quick script that will parse over my Apache access log and print out the most recent file accesses. The gotcha comes in with the fact that I want it to only print out the most recent file accessed from each unique IP, and I want the output sorted by date.

Here is my code:

#!/usr/bin/perl use warnings; use strict; use Date::Manip; use vars qw(%ipHash); while(<DATA>) { / ^((\d{1,3}\.){3}\d{1,3}) # grab the IP address into $1 \s\-\s\-\s\[ (\d\d\/\w{3}\/\d\d(\d\d\:){3}\d\d) # grab the date into $3 \s\-\d{4}\]\s"\w{1,4}\s ([\/|\w|\.|_]+) # grab the file path into $5 /x and $ipHash{&UnixDate($3,"%s")} = [$1, $3, $5]; } print join "\n", map {$ipHash{$_}[0] . " => " . $ipHash{$_}[1] . "\t" . $ipHash{$_} +[2]} sort keys %ipHash; __DATA__ 209.36.83.252 - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%%35%63 +../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 303 209.36.83.252 - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%%35c.. +/winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 303 209.36.83.252 - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%25%35% +63../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 313 209.36.83.252 - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%252f.. +/winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 313 68.9.44.75 - - [17/Oct/2002:06:50:34 -0400] "GET /phpMyAdmin HTTP/1.1" + 301 322 68.9.44.75 - - [17/Oct/2002:06:50:34 -0400] "GET /phpMyAdmin/ HTTP/1.1 +" 200 898 68.9.44.75 - - [17/Oct/2002:06:50:36 -0400] "GET /phpMyAdmin/left.php? +lang=en-iso-8859-1&convcharset=iso-8859-1&server=1 HTTP/1.1" 200 1024 129.22.39.158 - - [17/Oct/2002:18:05:10 -0400] "OPTIONS / HTTP/1.1" 20 +0 0 160.79.211.121 - - [17/Oct/2002:19:51:31 -0400] "GET /default.ida?NNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNN%u9090%u6858%ucbd3%u7801%u9090%u6858%ucbd3%u7801%u9090%u +6858%ucbd3%u7801%u9090%u9090%u8190%u00c3%u0003%u8b00%u531b%u53ff%u007 +8%u0000%u00=a HTTP/1.0" 400 303 129.22.82.8 - - [17/Oct/2002:20:37:10 -0400] "GET /index.php HTTP/1.1" + 200 25430 129.22.82.8 - - [17/Oct/2002:20:37:10 -0400] "GET /index.php?=PHPE9568 +F35-D428-11d2-A769-00AA001ACF42 HTTP/1.1" 200 4440 129.22.82.8 - - [17/Oct/2002:20:37:10 -0400] "GET /index.php?=PHPE9568 +F34-D428-11d2-A769-00AA001ACF42 HTTP/1.1" 200 2962 129.22.82.8 - - [17/Oct/2002:21:25:44 -0400] "GET / HTTP/1.1" 200 2673 129.22.82.8 - - [17/Oct/2002:21:25:44 -0400] "GET /manual/images/apach +e_pb.gif HTTP/1.1" 404 302
That code produces the following output:
209.36.83.252 => 17/Oct/2002:05:53:17 /scripts/.. 68.9.44.75 => 17/Oct/2002:06:50:34 /phpMyAdmin/ 68.9.44.75 => 17/Oct/2002:06:50:36 /phpMyAdmin/left.php 160.79.211.121 => 17/Oct/2002:19:51:31 /default.ida 129.22.82.8 => 17/Oct/2002:20:37:10 /index.php 129.22.82.8 => 17/Oct/2002:21:25:44 /manual/images/apache_pb.gif
Now, ideally, I would not want the IP's repeated. Rather, I just want to see the last file accessed by that IP. So, the output would look like:
209.36.83.252 => 17/Oct/2002:05:53:17 /scripts/.. 68.9.44.75 => 17/Oct/2002:06:50:34 /phpMyAdmin/ 160.79.211.121 => 17/Oct/2002:19:51:31 /default.ida 129.22.82.8 => 17/Oct/2002:21:25:44 /manual/images/apache_pb.gif
But, to maintain my sorting by date, I key the hash by the Unix timestamp and not the IP. Would I need to set up a dualing hash thing so I can sort by date but keep only one entry for each IP address? I just can't seem to wrap my head around this and wondered if any monks had some nifty ideas.

Thanks,
enoch

P.S. You gotta love those 'default.ida?NNNNNNN' entries.

In reply to Parsing Apache Log to Get Most Recent File Access by enoch

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2023-12-08 16:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?











    Results (36 votes). Check out past polls.

    Notices?