Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Re: Multi-Format Log Parser - Version 2.0

by cjensen (Sexton)
on Jan 18, 2002 at 23:05 UTC ( [id://139916]=note: print w/replies, xml ) Need Help??


in reply to Re: Multi-Format Log Parser - Version 2.0
in thread Multi-Format Log Parser - Version 2.0

You're right about the improper use of quotes within a user agent string. That could cause pattern matches to fail, and those would be skipped. I'm thinking about adding an option to print log lines that don't match the currently selected format to STDERR, or a count of lines that didn't match. From using this on a fairly large web site, I know the patterns match our traffic fairly well, but it will be interesting to see how many lines don't match and why. I did a dump of counts per unique user agent string using this log parser a few days ago for our QA department and in one day's worth of logs there were 82,279 unique user agent strings. Our QA guys are after percentages of traffic per browser and platform, and I don't relish their job of parsing all the user agent strings to get that information since they don't follow any standardized format.
  • Comment on Re: Re: Multi-Format Log Parser - Version 2.0

Replies are listed 'Best First'.
Re: Re: Re: Multi-Format Log Parser - Version 2.0
by cjensen (Sexton) on Jan 23, 2002 at 06:08 UTC
    I implemented a quick debug option that spits non-matches out to STDERR. In testing I found a pattern bug with byte counts of 304 log entries. Both are fixed in the following diff:
    26c26 < GetOptions (\%optctl, "type|t=s", "pattern|p=s"); --- > GetOptions (\%optctl, "type|t=s", "pattern|p=s", "debug|d=i"); 30,32c30,32 < 'common' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) (\d+)}, [qw(h l u t r c b)] ], < 'virtual' => [ qr{(\S+) (\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*) +\" (\d+) (\d+)}, [qw(v h l u t r c b)] ], < 'combined' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) (\d+) \"([^\"]*)\" \"([^\"]*)\"}, [qw(h l u t r c b R A)] ], --- > 'common' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) ([\d\-]+)}, [qw(h l u t r c b)] ], > 'virtual' => [ qr{(\S+) (\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*) +\" (\d+) ([\d\-]+)}, [qw(v h l u t r c b)] ], > 'combined' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) ([\d\-]+) \"([^\"]*)\" \"([^\"]*)\"}, [qw(h l u t r c b R A)] ], 35,36c35,36 < 'extended' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) (\d+) \"([^\"]*)\" \"([^\"]*)\" (\d+) (\d+)}, [qw(h l u t r c b R +A P T)] ], < 'custom' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) (\d+) \"([^\"]*)\" \"([^\"]*)\" (\d+)}, [qw(h l u t r c b A R T)] +], --- > 'extended' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) ([\d\-]+) \"([^\"]*)\" \"([^\"]*)\" (\d+) (\d+)}, [qw(h l u t r c +b R A P T)] ], > 'custom' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) ([\d\-]+) \"([^\"]*)\" \"([^\"]*)\" (\d+)}, [qw(h l u t r c b A R +T)] ], 102a103,104 > } elsif ($optctl{debug} == 1) { > print STDERR $_;

    With the new patterns, a quick match against 79154 lines from an access log of 'extended' format had 8 lines which didn't match. All of them were because of quotes in the request or the user agent strings.

    Here's a user agent that didn't match...
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461; <HTML><A% +20HREF="http://www.pghconnect.com/">www.pghconnect.com</a></HTML>)"

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://139916]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2024-04-26 06:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found