Parsing Apache logs with Regex

TheGorf has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing Apache logs with Regex by merlyn (Sage) on Dec 31, 2008 at 20:07 UTC
I've been happy with Apache::ParseLog. Why work harder than you must? -- Randal L. Schwartz, Perl hacker	[reply]
Re^2: Parsing Apache logs with Regex by TheGorf (Novice) on Dec 31, 2008 at 20:56 UTC
I posted below... ParseLog only appears to create reports and stuff based on the logfile that you hand it. I don't want that. I need something that I can parse line by line through a logfile and it will return me the log entry sections so that I can insert them into a database and then later crunch on them looking for some problems we have been having? Can ParseLog do that? Nothing in the documentation seems to suggest that it can do that.	[reply]
Re^3: Parsing Apache logs with Regex by Anonymous Monk on Dec 31, 2008 at 21:40 UTC
No, but it has tested regexes you can copy/paste http://search.cpan.org/src/AKIRA/Apache-ParseLog-1.02/ParseLog.pm	[reply]
Re: Parsing Apache logs with Regex by gwadej (Chaplain) on Dec 31, 2008 at 20:09 UTC
For complicated regexes, you should use the same advice you would for complicated code, break it up. For something this large, I would definitely use the `x` modifier to allow ignoring whitespace and comments. You also want to be more specific in your matches where possible. `my $log_pattern = qr{ ^ ([\s.]+) \s # match the IP address - \s - \s # ignore these fields \[([^]]+)\] # here's probably where your problem was ... }x;` [download] Following the lead above, you should be able to construct the rest of the expression. You might also want to check out Apache::LogRegex. I've never used it, but it looks like it might solve your problem. G. Wade	[reply] [d/l] [select]
Re: Parsing Apache logs with Regex by borisz (Canon) on Dec 31, 2008 at 22:16 UTC
I inherit from Regexp::Log or Regexp::Log::Common. `my $foo = Regexp::Log::Common->new( format => ':common', capture => [qw( date req bytes )], ); my @fields = $foo->capture; my $re = $foo->regexp; while (<>) { my %data; @data{@fields} = /$re/; ... }` [download] Boris	[reply] [d/l]
Re: Parsing Apache logs with Regex by atcroft (Abbot) on Dec 31, 2008 at 20:24 UTC
My first suggestion would be to see if there is anything on CPAN that could handle the log entries in a way that would be of use to you. A quick search suggested things like Apache::Logmonster and Apache::ParseLog (among others), and you could look at the source of other modules to see how they may have done it. A while back, I looked at this kind of thing for my own curiousity. Looking at the code I was playing with, this might be of some usefulness to you: `# usual strict and warnings and such use Date::Parse; # to parse the entry date use Text::ParseWords; # to handled quoted entries # $logfile defined in skipped code open DF, $logfile or die $!; while (<DF>) { chomp; # the following conversion is so the date can be # captured intact s/(\[\|\])/"/g; my @part = Text::ParseWords::quotewords( '\s+', 0, $_ ); $part[3] = str2time( $part[3] ); # deal with whatever part # of the log entry here you need } close DF;` [download] Hope that helps.	[reply] [d/l]
Re: Parsing Apache logs with Regex by kyle (Abbot) on Dec 31, 2008 at 20:24 UTC
Works fine for me. use strict; use warnings; #use diagnostics; my $log_line = '67.60.185.31 - - [14/Jan/2008:02:25:54 -0800] "GET /di +splay.cgi?2643943\|3334115 HTTP/1.1" 200 55 "-" "Mozilla/5.0 (Macintos +h; U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 (KHTML, like Gecko) +" "67.60.185.31"'; my $log_pattern = q{(.) \- \- \[(.)\] \"(.) (.)\?(.) HTTP\/(.)\" + ([0-9]) ([0-9]) \"(.)\" \"(.)\" \"(.)\"}; my @fields = ( $log_line =~ /$log_pattern/ ); print "$_\n" for @fields; __END__ 67.60.185.31 14/Jan/2008:02:25:54 -0800 GET /display.cgi 2643943\|3334115 1.1 200 55 - Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 + (KHTML, like Gecko) 67.60.185.31 [download] However, I'd probably write it this way: use strict; use warnings; my $log_line = '67.60.185.31 - - [14/Jan/2008:02:25:54 -0800] "GET /di +splay.cgi?2643943\|3334115 HTTP/1.1" 200 55 "-" "Mozilla/5.0 (Macintos +h; U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 (KHTML, like Gecko) +" "67.60.185.31"'; my $ip_address = qr{ \d{1,3} (?: \. \d{1,3} ){3} }xms; my $log_pattern = qr{ ( $ip_address ) \s \S+ # user name \s \S+ # user group? \s \[ ( \d\d / # day (?: Jan \| Feb \| Mar \| Apr \| May \| Jun \| Jul \| Aug \| Sep \| Oct \| Nov \| Dec ) # month / \d{4} # year : \d\d : \d\d : \d\d # time \s+ \S+ # timezone ) \] \s \" ( [A-Z]+ ) # method (GET, POST) \s+ ( \S+ ) \? ( \S+ ) # URL parts \s+ HTTP/( 1\.\d ) # protocol version \" \s ( \d+ ) # response code \s+ ( \d+ ) # bytes of response \s \" ( . ) \" # referrer \s \" ( .* ) \" # user agent \s+ \" ( $ip_address ) \" }xms; my @fields = ( $log_line =~ /$log_pattern/ ); print "$_\n" for @fields; [download] Having written all that, now I'm betting there's a CPAN module that does this and more.	[reply] [d/l] [select]
Re^2: Parsing Apache logs with Regex by gone2015 (Deacon) on Jan 01, 2009 at 12:34 UTC
Once you have a regex that works, I would urge you to add a little bit of code to watch out for lines that the regex does not cope with. If you've not catered for a rare form of line, or new forms of line are invented in the future, it's better to be told about them -- rather than either silently ignoring them or quietly creating rubbish entries in the data base.	[reply]
Re: Parsing Apache logs with Regex by kennethk (Abbot) on Dec 31, 2008 at 20:17 UTC
With the following code (unmodified from your post): use strict; use warnings; my $log_pattern = q{(.) \- \- \[(.)\] \"(.) (.)\?(.) HTTP\/(.)\" + ([0-9]) ([0-9]) \"(.)\" \"(.)\" \"(.*)\"}; my $entry = '67.60.185.31 - - [14/Jan/2008:02:25:54 -0800] "GET /displ +ay.cgi?2643943\|3334115 HTTP/1.1" 200 55 "-" "Mozilla/5.0 (Macintosh; +U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 (KHTML, like Gecko)" " +67.60.185.31"'; $entry =~ /$log_pattern/; print $1, "\n"; print $2, "\n"; print $3, "\n"; print $4, "\n"; print $5, "\n"; print $6, "\n"; print $7, "\n"; print $8, "\n"; print $9, "\n"; print $10, "\n"; print $11, "\n"; [download] I get the output `67.60.185.31 14/Jan/2008:02:25:54 -0800 GET /display.cgi 2643943\|3334115 1.1 200 55 - Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 + (KHTML, like Gecko) 67.60.185.31` [download] How are you calling your expression?	[reply] [d/l] [select]
Re^2: Parsing Apache logs with Regex by TheGorf (Novice) on Dec 31, 2008 at 20:32 UTC
ok so I stepped back and used your example, but added in my file read and now with this code: `#!/usr/bin/perl -w use strict; use warnings; my $log_pattern = q{(.) \- \- \[(.)\] \"(.) (.)\?(.) HTTP\/(.)\" + ([0-9]) ([0-9]) \"(.)\" \"(.)\" \"(.*)\"}; open (LOG, "< $ARGV[0]") or die "Cannot open file $ARGV[0]\n"; my @log = <LOG>; close ( LOG ); my $line; foreach $line (@log) { $line =~ /$log_pattern/; print $1."\n"; print $2."\n"; print $3."\n"; print $4."\n"; print $5."\n"; print $6."\n"; print $7."\n"; print $8."\n"; print $9."\n"; print $10."\n"; print $11."\n"; } close(SEM);` [download] I get this: `Use of uninitialized value in concatenation (.) or string at parselogs + line 23. Use of uninitialized value in concatenation (.) or string at parselogs + line 24. Use of uninitialized value in concatenation (.) or string at parselogs + line 25. Use of uninitialized value in concatenation (.) or string at parselogs + line 26. Use of uninitialized value in concatenation (.) or string at parselogs + line 27.` [download] now I am very confused.	[reply] [d/l] [select]
Re^3: Parsing Apache logs with Regex by kennethk (Abbot) on Dec 31, 2008 at 20:42 UTC
The concatenation error result b/c you didn't match on $7-$11, so those variables didn't initialize, i.e. your regex failed to match. Are you sure your $lines match what you posted? In any case, the suggestions to use Apache::ParseLog are being given by very smart people. Unless there is a strong reason not to, I'd say do what they say.	[reply]
Re^4: Parsing Apache logs with Regex by TheGorf (Novice) on Dec 31, 2008 at 20:48 UTC
Re^5: Parsing Apache logs with Regex by kennethk (Abbot) on Dec 31, 2008 at 20:59 UTC
Re: Parsing Apache logs with Regex by hangon (Deacon) on Jan 01, 2009 at 07:10 UTC
From my toolbox: I threw this together a while back to parse log files and load them into sqlite. Its not pretty but it works. Feel free to use what you need. #!/usr/bin/perl # # Parses logfiles & loads to sqlite db use strict; use warnings; use DBI; #### CONFIG # FILES my $logfile = 'access_log'; my $dbfile = 'acclog.sdb'; # TABLES my $newlog = 'logentries'; my $oldlog = 'oldlog'; # IF needed - creates new input table & renames old one # REMEMBER to edit the table names above create(); #### END CONFIG my @names = qw(ip id user datime req status bytes referer agent); my @cols = qw( ip id user date time zone method bytes status url proto type datime req referer agent); my $colstr = join( ',', @cols ); my @places; for (@cols){push @places, '?'} my $places = join ',', @places; my $dbh = DBI->connect("DBI:SQLite:$dbfile") or die 'connect fail'; my $sql = qq(INSERT INTO `$newlog` ($colstr) values ( $places ) ); my $sth = $dbh->prepare($sql); open my $FH, "$logfile" or die "cannot open file: $logfile\n"; while (my $line = <$FH>){ my %dat; my @fields = $line =~ m/("[^\"]"\|\[.\]\|[^\s]+)/g; $fields[3] =~ s/[\[\]]//g; ($dat{date}, $dat{time}) = split /:/, $fields[3], 2; ($dat{time}, $dat{zone}) = split / /, $dat{time}, 2; $fields[4] =~ s/"//g; #" ($dat{method}, $dat{url}, $dat{proto}) = split / /, $fields[4]; if ($dat{url} =~ /\/$/){ $dat{type} = 'dir'; }else{ ($dat{type}) = $dat{url} =~ /(\.\w+)$/g; } $dat{type} = 'file' unless $dat{type}; $fields[7] =~ s/"//g; #" $fields[8] =~ s/"//g; #" for (0..$#names){ $dat{ $names[$_] } = $fields[$_]; } my @insert; for (@cols){ push @insert, $dat{$_}; } $sth->execute(@insert); } close $FH; $sth->finish(); $dbh->disconnect(); sub create{ my $dbh = DBI->connect("DBI:SQLite:$dbfile") or die 'connect faile +d'; my $zql = qq(ALTER TABLE $newlog RENAME TO $oldlog); $dbh->do($zql) or die 'rename failed'; my @cols = qw( ip id user date time zone method bytes status url proto typ +e datime req referer agent); my $colstr = join( ',', @cols ); my $sql = qq(CREATE TABLE $newlog (seq INTEGER PRIMARY KEY AUTOINC +REMENT, $colstr) ); $dbh->do($sql) or die 'create failed'; $dbh->disconnect(); } [download]	[reply] [d/l]
Re^2: Parsing Apache logs with Regex by Anonymous Monk on Jan 24, 2009 at 18:46 UTC
It may not be pretty but it works well and saved me a lot of time, thanks. Stuart - Webmaster Words	[reply]
Re: Parsing Apache logs with Regex by TheGorf (Novice) on Dec 31, 2008 at 20:53 UTC
So I found this LogRegex think here: http://search.cpan.org/~peterhi/Apache-LogRegex-1.5/lib/Apache/LogRegex.pm But does anyone know how to use it? The example code fails all over the place and doesn't do anything for my logfile line.	[reply]
Re^2: Parsing Apache logs with Regex by atcroft (Abbot) on Dec 31, 2008 at 22:21 UTC
I adapted the example given only slightly and had no real difficulty (other than you have to give it the Apache log format string to use): #!/usr/bin/perl use strict; use warnings; use Apache::LogRegex; use Data::Dumper; my $lr; my $log_format = q/%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"/; eval { $lr = Apache::LogRegex->new($log_format) }; die "Unable to parse log line: $@" if ($@); my %data; open DF, $ARGV[0] or die $!; while ( my $line_from_logfile = <DF> ) { eval { %data = $lr->parse($line_from_logfile); }; if (%data) { print Data::Dumper->Dump( [ \$line_from_logfile, \%data ], [qw(line_from_logfile data)] ), qq{\n}; # We have data to process } else { # We could not parse this line } } close DF; [download] With this, I got the following result (using Data::Dumper for output): $line_from_logfile = \'192.168.1.100 - - [07/Dec/2008:04:24:39 -0600] +"GET /some/file/here.html HTTP/1.1" 304 - "http://www.some-referring- +webserver.com/some/other/page.html" "Mozilla/4.0 (compatible; MSIE 7. +0; Windows NT 5.1; .NET CLR 1.1.4322)" '; %data = ( '%{Referer}i' => 'http://www.some-referring-webserver.com/ +some/other/page.html', '%{User-Agent}i' => 'Mozilla/4.0 (compatible; MSIE 7.0; Wi +ndows NT 5.1; .NET CLR 1.1.4322)', '%t' => '[07/Dec/2008:04:24:39 -0600]', '%r' => 'GET /some/file/here.html HTTP/1.1', '%h' => '192.168.1.100', '%b' => '-', '%l' => '-', '%u' => '-', '%>s' => '304' ); [download] Hope that helps.	[reply] [d/l] [select]
Re^2: Parsing Apache logs with Regex by Anonymous Monk on Dec 31, 2008 at 21:42 UTC
What did you write, because it works for me!	[reply]


P is for Practical
	PerlMonks