Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Parsing Apache logs with Regex

by TheGorf (Novice)
on Dec 31, 2008 at 19:54 UTC ( [id://733540]=perlquestion: print w/replies, xml ) Need Help??

TheGorf has asked for the wisdom of the Perl Monks concerning the following question:

For a project here at work I have to comb through a lot of logfiles trying to detect some issues. I'm trying to scan through the Apache logfiles and split them into logical pieces line by line so I can mangle the data as needed.

So lets say I have this actual entry in my logfile:
67.60.185.31 - - [14/Jan/2008:02:25:54 -0800] "GET /display.cgi?2643943|3334115 HTTP/1.1" 200 55 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 (KHTML, like Gecko)" "67.60.185.31"

I'm attempting to break the line up with this regex:

my $log_pattern = q{(.*) \- \- \[(.*)\] \"(.*) (.*)\?(.*) HTTP\/(.*)\" + ([0-9]*) ([0-9]*) \"(.*)\" \"(.*)\" \"(.*)\"};


For some reason the code tag formats that funny. Anyway... that pattern never matches to the logfile entry. I'm not hugely strong on Regex so I am hoping I am just overlooking something obvious.

Thank you!

Replies are listed 'Best First'.
Re: Parsing Apache logs with Regex
by merlyn (Sage) on Dec 31, 2008 at 20:07 UTC
      I posted below... ParseLog only appears to create reports and stuff based on the logfile that you hand it. I don't want that. I need something that I can parse line by line through a logfile and it will return me the log entry sections so that I can insert them into a database and then later crunch on them looking for some problems we have been having? Can ParseLog do that? Nothing in the documentation seems to suggest that it can do that.
Re: Parsing Apache logs with Regex
by gwadej (Chaplain) on Dec 31, 2008 at 20:09 UTC

    For complicated regexes, you should use the same advice you would for complicated code, break it up. For something this large, I would definitely use the x modifier to allow ignoring whitespace and comments. You also want to be more specific in your matches where possible.

    my $log_pattern = qr{ ^ ([\s.]+) \s # match the IP address - \s - \s # ignore these fields \[([^]]+)\] # here's probably where your problem was ... }x;

    Following the lead above, you should be able to construct the rest of the expression.

    You might also want to check out Apache::LogRegex. I've never used it, but it looks like it might solve your problem.

    G. Wade
Re: Parsing Apache logs with Regex
by borisz (Canon) on Dec 31, 2008 at 22:16 UTC
    I inherit from Regexp::Log or Regexp::Log::Common.
    my $foo = Regexp::Log::Common->new( format => ':common', capture => [qw( date req bytes )], ); my @fields = $foo->capture; my $re = $foo->regexp; while (<>) { my %data; @data{@fields} = /$re/; ... }
    Boris
Re: Parsing Apache logs with Regex
by atcroft (Abbot) on Dec 31, 2008 at 20:24 UTC

    My first suggestion would be to see if there is anything on CPAN that could handle the log entries in a way that would be of use to you. A quick search suggested things like Apache::Logmonster and Apache::ParseLog (among others), and you could look at the source of other modules to see how they may have done it.

    A while back, I looked at this kind of thing for my own curiousity. Looking at the code I was playing with, this might be of some usefulness to you:

    # usual strict and warnings and such use Date::Parse; # to parse the entry date use Text::ParseWords; # to handled quoted entries # $logfile defined in skipped code open DF, $logfile or die $!; while (<DF>) { chomp; # the following conversion is so the date can be # captured intact s/(\[|\])/"/g; my @part = Text::ParseWords::quotewords( '\s+', 0, $_ ); $part[3] = str2time( $part[3] ); # deal with whatever part # of the log entry here you need } close DF;

    Hope that helps.

Re: Parsing Apache logs with Regex
by kyle (Abbot) on Dec 31, 2008 at 20:24 UTC

    Works fine for me.

    use strict; use warnings; #use diagnostics; my $log_line = '67.60.185.31 - - [14/Jan/2008:02:25:54 -0800] "GET /di +splay.cgi?2643943|3334115 HTTP/1.1" 200 55 "-" "Mozilla/5.0 (Macintos +h; U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 (KHTML, like Gecko) +" "67.60.185.31"'; my $log_pattern = q{(.*) \- \- \[(.*)\] \"(.*) (.*)\?(.*) HTTP\/(.*)\" + ([0-9]*) ([0-9]*) \"(.*)\" \"(.*)\" \"(.*)\"}; my @fields = ( $log_line =~ /$log_pattern/ ); print "$_\n" for @fields; __END__ 67.60.185.31 14/Jan/2008:02:25:54 -0800 GET /display.cgi 2643943|3334115 1.1 200 55 - Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 + (KHTML, like Gecko) 67.60.185.31

    However, I'd probably write it this way:

    use strict; use warnings; my $log_line = '67.60.185.31 - - [14/Jan/2008:02:25:54 -0800] "GET /di +splay.cgi?2643943|3334115 HTTP/1.1" 200 55 "-" "Mozilla/5.0 (Macintos +h; U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 (KHTML, like Gecko) +" "67.60.185.31"'; my $ip_address = qr{ \d{1,3} (?: \. \d{1,3} ){3} }xms; my $log_pattern = qr{ ( $ip_address ) \s \S+ # user name \s \S+ # user group? \s \[ ( \d\d / # day (?: Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec ) # month / \d{4} # year : \d\d : \d\d : \d\d # time \s+ \S+ # timezone ) \] \s \" ( [A-Z]+ ) # method (GET, POST) \s+ ( \S+ ) \? ( \S+ ) # URL parts \s+ HTTP/( 1\.\d ) # protocol version \" \s ( \d+ ) # response code \s+ ( \d+ ) # bytes of response \s \" ( .* ) \" # referrer \s \" ( .* ) \" # user agent \s+ \" ( $ip_address ) \" }xms; my @fields = ( $log_line =~ /$log_pattern/ ); print "$_\n" for @fields;

    Having written all that, now I'm betting there's a CPAN module that does this and more.

      Once you have a regex that works, I would urge you to add a little bit of code to watch out for lines that the regex does not cope with. If you've not catered for a rare form of line, or new forms of line are invented in the future, it's better to be told about them -- rather than either silently ignoring them or quietly creating rubbish entries in the data base.

Re: Parsing Apache logs with Regex
by kennethk (Abbot) on Dec 31, 2008 at 20:17 UTC

    With the following code (unmodified from your post):

    use strict; use warnings; my $log_pattern = q{(.*) \- \- \[(.*)\] \"(.*) (.*)\?(.*) HTTP\/(.*)\" + ([0-9]*) ([0-9]*) \"(.*)\" \"(.*)\" \"(.*)\"}; my $entry = '67.60.185.31 - - [14/Jan/2008:02:25:54 -0800] "GET /displ +ay.cgi?2643943|3334115 HTTP/1.1" 200 55 "-" "Mozilla/5.0 (Macintosh; +U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 (KHTML, like Gecko)" " +67.60.185.31"'; $entry =~ /$log_pattern/; print $1, "\n"; print $2, "\n"; print $3, "\n"; print $4, "\n"; print $5, "\n"; print $6, "\n"; print $7, "\n"; print $8, "\n"; print $9, "\n"; print $10, "\n"; print $11, "\n";

    I get the output

    67.60.185.31 14/Jan/2008:02:25:54 -0800 GET /display.cgi 2643943|3334115 1.1 200 55 - Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 + (KHTML, like Gecko) 67.60.185.31

    How are you calling your expression?

      ok so I stepped back and used your example, but added in my file read and now with this code:

      #!/usr/bin/perl -w use strict; use warnings; my $log_pattern = q{(.*) \- \- \[(.*)\] \"(.*) (.*)\?(.*) HTTP\/(.*)\" + ([0-9]*) ([0-9]*) \"(.*)\" \"(.*)\" \"(.*)\"}; open (LOG, "< $ARGV[0]") or die "Cannot open file $ARGV[0]\n"; my @log = <LOG>; close ( LOG ); my $line; foreach $line (@log) { $line =~ /$log_pattern/; print $1."\n"; print $2."\n"; print $3."\n"; print $4."\n"; print $5."\n"; print $6."\n"; print $7."\n"; print $8."\n"; print $9."\n"; print $10."\n"; print $11."\n"; } close(SEM);
      I get this:
      Use of uninitialized value in concatenation (.) or string at parselogs + line 23. Use of uninitialized value in concatenation (.) or string at parselogs + line 24. Use of uninitialized value in concatenation (.) or string at parselogs + line 25. Use of uninitialized value in concatenation (.) or string at parselogs + line 26. Use of uninitialized value in concatenation (.) or string at parselogs + line 27.
      now I am very confused.

        The concatenation error result b/c you didn't match on $7-$11, so those variables didn't initialize, i.e. your regex failed to match. Are you sure your $lines match what you posted?

        In any case, the suggestions to use Apache::ParseLog are being given by very smart people. Unless there is a strong reason not to, I'd say do what they say.

Re: Parsing Apache logs with Regex
by hangon (Deacon) on Jan 01, 2009 at 07:10 UTC

    From my toolbox: I threw this together a while back to parse log files and load them into sqlite. Its not pretty but it works. Feel free to use what you need.

    #!/usr/bin/perl # # Parses logfiles & loads to sqlite db use strict; use warnings; use DBI; #### CONFIG # FILES my $logfile = 'access_log'; my $dbfile = 'acclog.sdb'; # TABLES my $newlog = 'logentries'; my $oldlog = 'oldlog'; # IF needed - creates new input table & renames old one # REMEMBER to edit the table names above create(); #### END CONFIG my @names = qw(ip id user datime req status bytes referer agent); my @cols = qw( ip id user date time zone method bytes status url proto type datime req referer agent); my $colstr = join( ',', @cols ); my @places; for (@cols){push @places, '?'} my $places = join ',', @places; my $dbh = DBI->connect("DBI:SQLite:$dbfile") or die 'connect fail'; my $sql = qq(INSERT INTO `$newlog` ($colstr) values ( $places ) ); my $sth = $dbh->prepare($sql); open my $FH, "$logfile" or die "cannot open file: $logfile\n"; while (my $line = <$FH>){ my %dat; my @fields = $line =~ m/("[^\"]*"|\[.*\]|[^\s]+)/g; $fields[3] =~ s/[\[\]]//g; ($dat{date}, $dat{time}) = split /:/, $fields[3], 2; ($dat{time}, $dat{zone}) = split / /, $dat{time}, 2; $fields[4] =~ s/"//g; #" ($dat{method}, $dat{url}, $dat{proto}) = split / /, $fields[4]; if ($dat{url} =~ /\/$/){ $dat{type} = 'dir'; }else{ ($dat{type}) = $dat{url} =~ /(\.\w+)$/g; } $dat{type} = 'file' unless $dat{type}; $fields[7] =~ s/"//g; #" $fields[8] =~ s/"//g; #" for (0..$#names){ $dat{ $names[$_] } = $fields[$_]; } my @insert; for (@cols){ push @insert, $dat{$_}; } $sth->execute(@insert); } close $FH; $sth->finish(); $dbh->disconnect(); sub create{ my $dbh = DBI->connect("DBI:SQLite:$dbfile") or die 'connect faile +d'; my $zql = qq(ALTER TABLE $newlog RENAME TO $oldlog); $dbh->do($zql) or die 'rename failed'; my @cols = qw( ip id user date time zone method bytes status url proto typ +e datime req referer agent); my $colstr = join( ',', @cols ); my $sql = qq(CREATE TABLE $newlog (seq INTEGER PRIMARY KEY AUTOINC +REMENT, $colstr) ); $dbh->do($sql) or die 'create failed'; $dbh->disconnect(); }
      It may not be pretty but it works well and saved me a lot of time, thanks. Stuart - Webmaster Words
Re: Parsing Apache logs with Regex
by TheGorf (Novice) on Dec 31, 2008 at 20:53 UTC
    So I found this LogRegex think here: http://search.cpan.org/~peterhi/Apache-LogRegex-1.5/lib/Apache/LogRegex.pm But does anyone know how to use it? The example code fails all over the place and doesn't do anything for my logfile line.

      I adapted the example given only slightly and had no real difficulty (other than you have to give it the Apache log format string to use):

      #!/usr/bin/perl use strict; use warnings; use Apache::LogRegex; use Data::Dumper; my $lr; my $log_format = q/%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"/; eval { $lr = Apache::LogRegex->new($log_format) }; die "Unable to parse log line: $@" if ($@); my %data; open DF, $ARGV[0] or die $!; while ( my $line_from_logfile = <DF> ) { eval { %data = $lr->parse($line_from_logfile); }; if (%data) { print Data::Dumper->Dump( [ \$line_from_logfile, \%data ], [qw(*line_from_logfile *data)] ), qq{\n}; # We have data to process } else { # We could not parse this line } } close DF;

      With this, I got the following result (using Data::Dumper for output):

      $line_from_logfile = \'192.168.1.100 - - [07/Dec/2008:04:24:39 -0600] +"GET /some/file/here.html HTTP/1.1" 304 - "http://www.some-referring- +webserver.com/some/other/page.html" "Mozilla/4.0 (compatible; MSIE 7. +0; Windows NT 5.1; .NET CLR 1.1.4322)" '; %data = ( '%{Referer}i' => 'http://www.some-referring-webserver.com/ +some/other/page.html', '%{User-Agent}i' => 'Mozilla/4.0 (compatible; MSIE 7.0; Wi +ndows NT 5.1; .NET CLR 1.1.4322)', '%t' => '[07/Dec/2008:04:24:39 -0600]', '%r' => 'GET /some/file/here.html HTTP/1.1', '%h' => '192.168.1.100', '%b' => '-', '%l' => '-', '%u' => '-', '%>s' => '304' );

      Hope that helps.

      What did you write, because it works for me!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://733540]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-25 10:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found