Re: Parsing Apache logs with Regex
by merlyn (Sage) on Dec 31, 2008 at 20:07 UTC
|
| [reply] |
|
I posted below... ParseLog only appears to create reports and stuff based on the logfile that you hand it. I don't want that. I need something that I can parse line by line through a logfile and it will return me the log entry sections so that I can insert them into a database and then later crunch on them looking for some problems we have been having? Can ParseLog do that? Nothing in the documentation seems to suggest that it can do that.
| [reply] |
|
| [reply] |
Re: Parsing Apache logs with Regex
by gwadej (Chaplain) on Dec 31, 2008 at 20:09 UTC
|
For complicated regexes, you should use the same advice you would for complicated code, break it up. For something this large, I would definitely use the x modifier to allow ignoring whitespace and comments. You also want to be more specific in your matches where possible.
my $log_pattern = qr{
^
([\s.]+) \s # match the IP address
- \s - \s # ignore these fields
\[([^]]+)\] # here's probably where your problem was
...
}x;
Following the lead above, you should be able to construct the rest of the expression.
You might also want to check out Apache::LogRegex. I've never used it, but it looks like it might solve your problem.
| [reply] [d/l] [select] |
Re: Parsing Apache logs with Regex
by borisz (Canon) on Dec 31, 2008 at 22:16 UTC
|
my $foo = Regexp::Log::Common->new(
format => ':common',
capture => [qw( date req bytes )],
);
my @fields = $foo->capture;
my $re = $foo->regexp;
while (<>) {
my %data;
@data{@fields} = /$re/;
...
}
| [reply] [d/l] |
Re: Parsing Apache logs with Regex
by atcroft (Abbot) on Dec 31, 2008 at 20:24 UTC
|
My first suggestion would be to see if there is anything on CPAN that could handle the log entries in a way that would be of use to you. A quick search suggested things like Apache::Logmonster and Apache::ParseLog (among others), and you could look at the source of other modules to see how they may have done it.
A while back, I looked at this kind of thing for my own curiousity. Looking at the code I was playing with, this might be of some usefulness to you:
# usual strict and warnings and such
use Date::Parse; # to parse the entry date
use Text::ParseWords; # to handled quoted entries
# $logfile defined in skipped code
open DF, $logfile or die $!;
while (<DF>) {
chomp;
# the following conversion is so the date can be
# captured intact
s/(\[|\])/"/g;
my @part = Text::ParseWords::quotewords( '\s+', 0, $_ );
$part[3] = str2time( $part[3] );
# deal with whatever part
# of the log entry here you need
}
close DF;
Hope that helps. | [reply] [d/l] |
Re: Parsing Apache logs with Regex
by kyle (Abbot) on Dec 31, 2008 at 20:24 UTC
|
use strict;
use warnings;
#use diagnostics;
my $log_line = '67.60.185.31 - - [14/Jan/2008:02:25:54 -0800] "GET /di
+splay.cgi?2643943|3334115 HTTP/1.1" 200 55 "-" "Mozilla/5.0 (Macintos
+h; U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 (KHTML, like Gecko)
+" "67.60.185.31"';
my $log_pattern = q{(.*) \- \- \[(.*)\] \"(.*) (.*)\?(.*) HTTP\/(.*)\"
+ ([0-9]*) ([0-9]*) \"(.*)\" \"(.*)\" \"(.*)\"};
my @fields = ( $log_line =~ /$log_pattern/ );
print "$_\n" for @fields;
__END__
67.60.185.31
14/Jan/2008:02:25:54 -0800
GET
/display.cgi
2643943|3334115
1.1
200
55
-
Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-us) AppleWebKit/523.10.6
+ (KHTML, like Gecko)
67.60.185.31
However, I'd probably write it this way:
use strict;
use warnings;
my $log_line = '67.60.185.31 - - [14/Jan/2008:02:25:54 -0800] "GET /di
+splay.cgi?2643943|3334115 HTTP/1.1" 200 55 "-" "Mozilla/5.0 (Macintos
+h; U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 (KHTML, like Gecko)
+" "67.60.185.31"';
my $ip_address = qr{ \d{1,3} (?: \. \d{1,3} ){3} }xms;
my $log_pattern
= qr{
( $ip_address )
\s \S+ # user name
\s \S+ # user group?
\s
\[
(
\d\d / # day
(?: Jan | Feb | Mar | Apr | May | Jun
| Jul | Aug | Sep | Oct | Nov | Dec ) # month
/ \d{4} # year
: \d\d : \d\d : \d\d # time
\s+ \S+ # timezone
)
\]
\s
\"
( [A-Z]+ ) # method (GET, POST)
\s+
( \S+ ) \? ( \S+ ) # URL parts
\s+
HTTP/( 1\.\d ) # protocol version
\"
\s
( \d+ ) # response code
\s+
( \d+ ) # bytes of response
\s
\" ( .* ) \" # referrer
\s
\" ( .* ) \" # user agent
\s+
\" ( $ip_address ) \"
}xms;
my @fields = ( $log_line =~ /$log_pattern/ );
print "$_\n" for @fields;
Having written all that, now I'm betting there's a CPAN module that does this and more. | [reply] [d/l] [select] |
|
Once you have a regex that works, I would urge you to add a little bit of code to watch out for lines that the regex does not cope with. If you've not catered for a rare form of line, or new forms of line are invented in the future, it's better to be told about them -- rather than either silently ignoring them or quietly creating rubbish entries in the data base.
| [reply] |
Re: Parsing Apache logs with Regex
by kennethk (Abbot) on Dec 31, 2008 at 20:17 UTC
|
use strict;
use warnings;
my $log_pattern = q{(.*) \- \- \[(.*)\] \"(.*) (.*)\?(.*) HTTP\/(.*)\"
+ ([0-9]*) ([0-9]*) \"(.*)\" \"(.*)\" \"(.*)\"};
my $entry = '67.60.185.31 - - [14/Jan/2008:02:25:54 -0800] "GET /displ
+ay.cgi?2643943|3334115 HTTP/1.1" 200 55 "-" "Mozilla/5.0 (Macintosh;
+U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 (KHTML, like Gecko)" "
+67.60.185.31"';
$entry =~ /$log_pattern/;
print $1, "\n";
print $2, "\n";
print $3, "\n";
print $4, "\n";
print $5, "\n";
print $6, "\n";
print $7, "\n";
print $8, "\n";
print $9, "\n";
print $10, "\n";
print $11, "\n";
I get the output
67.60.185.31
14/Jan/2008:02:25:54 -0800
GET
/display.cgi
2643943|3334115
1.1
200
55
-
Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-us) AppleWebKit/523.10.6
+ (KHTML, like Gecko)
67.60.185.31
How are you calling your expression?
| [reply] [d/l] [select] |
|
ok so I stepped back and used your example, but added in my file read and now with this code:
#!/usr/bin/perl -w
use strict;
use warnings;
my $log_pattern = q{(.*) \- \- \[(.*)\] \"(.*) (.*)\?(.*) HTTP\/(.*)\"
+ ([0-9]*) ([0-9]*) \"(.*)\" \"(.*)\" \"(.*)\"};
open (LOG, "< $ARGV[0]") or die "Cannot open file $ARGV[0]\n";
my @log = <LOG>;
close ( LOG );
my $line;
foreach $line (@log)
{
$line =~ /$log_pattern/;
print $1."\n";
print $2."\n";
print $3."\n";
print $4."\n";
print $5."\n";
print $6."\n";
print $7."\n";
print $8."\n";
print $9."\n";
print $10."\n";
print $11."\n";
}
close(SEM);
I get this:
Use of uninitialized value in concatenation (.) or string at parselogs
+ line 23.
Use of uninitialized value in concatenation (.) or string at parselogs
+ line 24.
Use of uninitialized value in concatenation (.) or string at parselogs
+ line 25.
Use of uninitialized value in concatenation (.) or string at parselogs
+ line 26.
Use of uninitialized value in concatenation (.) or string at parselogs
+ line 27.
now I am very confused. | [reply] [d/l] [select] |
|
The concatenation error result b/c you didn't match on $7-$11, so those variables didn't initialize, i.e. your regex failed to match. Are you sure your $lines match what you posted?
In any case, the suggestions to use Apache::ParseLog are being given by very smart people. Unless there is a strong reason not to, I'd say do what they say.
| [reply] |
|
|
Re: Parsing Apache logs with Regex
by hangon (Deacon) on Jan 01, 2009 at 07:10 UTC
|
From my toolbox: I threw this together a while back to parse log files and load them into sqlite. Its not pretty but it works. Feel free to use what you need.
#!/usr/bin/perl
#
# Parses logfiles & loads to sqlite db
use strict;
use warnings;
use DBI;
#### CONFIG
# FILES
my $logfile = 'access_log';
my $dbfile = 'acclog.sdb';
# TABLES
my $newlog = 'logentries';
my $oldlog = 'oldlog';
# IF needed - creates new input table & renames old one
# REMEMBER to edit the table names above
create();
#### END CONFIG
my @names = qw(ip id user datime req status bytes referer agent);
my @cols = qw( ip id user date time zone
method bytes status url proto type
datime req referer agent);
my $colstr = join( ',', @cols );
my @places;
for (@cols){push @places, '?'}
my $places = join ',', @places;
my $dbh = DBI->connect("DBI:SQLite:$dbfile") or die 'connect fail';
my $sql = qq(INSERT INTO `$newlog` ($colstr) values ( $places ) );
my $sth = $dbh->prepare($sql);
open my $FH, "$logfile" or die "cannot open file: $logfile\n";
while (my $line = <$FH>){
my %dat;
my @fields = $line =~ m/("[^\"]*"|\[.*\]|[^\s]+)/g;
$fields[3] =~ s/[\[\]]//g;
($dat{date}, $dat{time}) = split /:/, $fields[3], 2;
($dat{time}, $dat{zone}) = split / /, $dat{time}, 2;
$fields[4] =~ s/"//g; #"
($dat{method}, $dat{url}, $dat{proto}) = split / /, $fields[4];
if ($dat{url} =~ /\/$/){
$dat{type} = 'dir';
}else{
($dat{type}) = $dat{url} =~ /(\.\w+)$/g;
}
$dat{type} = 'file' unless $dat{type};
$fields[7] =~ s/"//g; #"
$fields[8] =~ s/"//g; #"
for (0..$#names){
$dat{ $names[$_] } = $fields[$_];
}
my @insert;
for (@cols){
push @insert, $dat{$_};
}
$sth->execute(@insert);
}
close $FH;
$sth->finish();
$dbh->disconnect();
sub create{
my $dbh = DBI->connect("DBI:SQLite:$dbfile") or die 'connect faile
+d';
my $zql = qq(ALTER TABLE $newlog RENAME TO $oldlog);
$dbh->do($zql) or die 'rename failed';
my @cols = qw( ip id user date time zone
method bytes status url proto typ
+e
datime req referer agent);
my $colstr = join( ',', @cols );
my $sql = qq(CREATE TABLE $newlog (seq INTEGER PRIMARY KEY AUTOINC
+REMENT, $colstr) );
$dbh->do($sql) or die 'create failed';
$dbh->disconnect();
}
| [reply] [d/l] |
|
It may not be pretty but it works well and saved me a lot of time, thanks. Stuart - Webmaster Words
| [reply] |
Re: Parsing Apache logs with Regex
by TheGorf (Novice) on Dec 31, 2008 at 20:53 UTC
|
So I found this LogRegex think here:
http://search.cpan.org/~peterhi/Apache-LogRegex-1.5/lib/Apache/LogRegex.pm
But does anyone know how to use it? The example code fails all over the place and doesn't do anything for my logfile line. | [reply] |
|
#!/usr/bin/perl
use strict;
use warnings;
use Apache::LogRegex;
use Data::Dumper;
my $lr;
my $log_format =
q/%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"/;
eval { $lr = Apache::LogRegex->new($log_format) };
die "Unable to parse log line: $@" if ($@);
my %data;
open DF, $ARGV[0] or die $!;
while ( my $line_from_logfile = <DF> ) {
eval { %data = $lr->parse($line_from_logfile); };
if (%data) {
print Data::Dumper->Dump(
[ \$line_from_logfile, \%data ],
[qw(*line_from_logfile *data)]
),
qq{\n};
# We have data to process
}
else {
# We could not parse this line
}
}
close DF;
With this, I got the following result (using Data::Dumper for output):
$line_from_logfile = \'192.168.1.100 - - [07/Dec/2008:04:24:39 -0600]
+"GET /some/file/here.html HTTP/1.1" 304 - "http://www.some-referring-
+webserver.com/some/other/page.html" "Mozilla/4.0 (compatible; MSIE 7.
+0; Windows NT 5.1; .NET CLR 1.1.4322)"
';
%data = (
'%{Referer}i' => 'http://www.some-referring-webserver.com/
+some/other/page.html',
'%{User-Agent}i' => 'Mozilla/4.0 (compatible; MSIE 7.0; Wi
+ndows NT 5.1; .NET CLR 1.1.4322)',
'%t' => '[07/Dec/2008:04:24:39 -0600]',
'%r' => 'GET /some/file/here.html HTTP/1.1',
'%h' => '192.168.1.100',
'%b' => '-',
'%l' => '-',
'%u' => '-',
'%>s' => '304'
);
Hope that helps. | [reply] [d/l] [select] |
|
What did you write, because it works for me!
| [reply] |