http://qs321.pair.com?node_id=653300

TStanley has asked for the wisdom of the Perl Monks concerning the following question:

As part of my company's PCI requirements, I have been assigned a task to parse a log file that is updated continually through the day. I figure that I can key off the date entry to get just the ones for the previous day, so that I don't duplicate anything, but my big issue is with extracting a specific piece of data. A small sample of the log is below:
2007-11-16 16:04:33 Local1.Alert 128.29.29.40 id=firewall tim +e="2007-11-16 16:04:08" fw=WS2000-Store 29 pri=1 proto=6(tcp) src=128 +.29.29.200 dst=128.29.100.102 mid= 1013 mtp= 2 msg=TCP connection re +quest received is invalid, dropping packet Src 23 Dst 4412 from EXT n +/w agent=Firewall 2007-11-16 16:05:05 Local1.Alert 128.24.24.40 id=firewall tim +e="2007-11-16 16:03:25" fw=WS2000-Store 24 pri=1 proto=6(tcp) src=128 +.24.24.200 dst=128.24.100.101 mid= 1013 mtp= 2 msg=TCP connection re +quest received is invalid, dropping packet Src 23 Dst 4344 from EXT n +/w agent=Firewall 2007-11-16 16:05:34 Local1.Alert 128.29.29.40 id=firewall tim +e="2007-11-16 16:05:09" fw=WS2000-Store 29 pri=1 proto=6(tcp) src=128 +.29.29.200 dst=128.29.100.102 mid= 1013 mtp= 2 msg=TCP connection re +quest received is invalid, dropping packet Src 23 Dst 4412 from EXT n +/w agent=Firewall 2007-11-16 16:05:39 Local1.Alert 128.2.2.40 id=firewall time= +"2007-11-16 16:03:36" fw=WS2000-Store 02 pri=1 proto=6(tcp) src=128.2 +.2.200 dst=128.2.100.106 mid= 1013 mtp= 2 msg=TCP connection request + received is invalid, dropping packet Src 23 Dst 4631 from EXT n/w ag +ent=Firewall 2007-11-16 16:05:40 Local1.Alert 128.2.2.40 id=firewall time= +"2007-11-16 16:03:36" fw=WS2000-Store 02 pri=1 proto=6(tcp) src=128.2 +.2.200 dst=128.2.100.106 mid= 1013 mtp= 2 msg=TCP connection request + received is invalid, dropping packet Src 23 Dst 4631 from EXT n/w ag +ent=Firewall 2007-11-16 16:05:40 Local1.Alert 128.2.2.40 id=firewall time= +"2007-11-16 16:03:37" fw=WS2000-Store 02 pri=1 proto=6(tcp) src=128.2 +.2.200 dst=128.2.100.106 mid= 1013 mtp= 2 msg=TCP connection request + received is invalid, dropping packet Src 23 Dst 4631 from EXT n/w ag +ent=Firewall
I need to extract the data starting with "msg=" and ending just before "Src". The code I am currently using to put the above data into a csv file that I later import to an Excel spreadsheet is below
#!perl use strict; open INPUT,"<","input_file.txt"||die "Can not open input_file: $!\n"; open CSV,">","OUTPUT.csv"||die "Can not open OUTPUT.csv: $!\n"; print CSV "Date,Time,WS 2000,FW Date,FW Time,Store,Src IP,Src Port,Dst + IP,Dst Port,Type,Agent\n"; while(<INPUT>){ my @line = split /\s+/; my $Date = $line[0]; my $Time = $line[1]; my $ws2k = $line[3]; my $FW_Date = $line[5]; my $FW_Time = $line[6]; my $store = $line[8]; my $src_ip = $line[11]; my $dst_ip = $line[12]; my $src_prt = $line[$#line - 6]; my $dst_prt = $line[$#line - 4]; my $type = $line[$#line - 2]; my $agent = $line[$#line]; chomp $agent; $agent=~s/agent=//; $FW_Date=~s/time="//; $FW_Time=~s/"//; $src_ip=~s/src=//; $dst_ip=~s/dst=//; print CSV "$Date,$Time,$ws2k,$FW_Date,$FW_Time,$store,$src_ip,$src_p +rt,$dst_ip,$dst_prt,$type,$agent\n"; } close INPUT; close CSV;
I am guessing that I need to do something along the lines of a regular expression before I actually split the line out into the array. Would something like the following even be close to what I am looking for, or am I heading in the wrong direction with this?
while(<INPUT>){ $_=~m/msg=(.*) Src/; my $msg=$1; my @line = split /\s+/; ## rest of code
As always, thanks for any suggestions.

TStanley
--------
People sleep peaceably in their beds at night only because rough men stand ready to do violence on their behalf. -- George Orwell

Replies are listed 'Best First'.
Re: Parsing a log file
by reasonablekeith (Deacon) on Nov 27, 2007 at 17:13 UTC
    Your data seems a bit too ambiguous to parse easily. You can't, as in your example, just split on white space, as your 'msg' data field is an unquoted string (so you'll get a seperate field for each word in the string). Your best option is to get the log file tab delimited (I'm guessing it isn't at the moment), and stop reading this post.

    However, if you don't have control of the data, you'll just have to take a stab. With the assumption that the first four fileds don't have spaces in them, and that the key/value data that follows doesn't have any '='s in the data, then you could do something like this...

    #!/usr/bin/perl use Data::Dumper; while(<DATA>) { chomp; my ($log_date, $log_time, $something, $ip_address, $keyed_data_str +ing ) = split /\s+/, $_, 5; my %keyed_data_hash; while( $keyed_data_string =~ s/\s*(\w+)=\s*([^=]*)$/ $keyed_data_h +ash{$1} = $2; ''/xsge) { 1 } print Dumper($log_date, $log_time, $something, $ip_address, \%keye +d_data_hash); } __DATA__ 2007-11-16 16:05:40 Local1.Alert 128.2.2.40 id=firewall time= +"2007-11-16 16:03:37" fw=WS2000-Store 02 pri=1 proto=6(tcp) src=128.2 +.2.200 dst=128.2.100.106 mid= 1013 mtp= 2 msg=TCP connection request + received is invalid, dropping packet Src 23 Dst 4631 from EXT n/w ag +ent=Firewall
    OUTPUT
    $VAR1 = '2007-11-16'; $VAR2 = '16:05:40'; $VAR3 = 'Local1.Alert'; $VAR4 = '128.2.2.40'; $VAR5 = { 'msg' => 'TCP connection request received is invalid, drop +ping packet Src 23 Dst 4631 from EXT n/w', 'proto' => '6(tcp)', 'time' => '"2007-11-16 16:03:37"', 'src' => '128.2.2.200', 'mtp' => '2', 'mid' => '1013', 'fw' => 'WS2000-Store 02', 'agent' => 'Firewall', 'id' => 'firewall', 'pri' => '1', 'dst' => '128.2.100.106' };
    ... which works by pulling off the easy four first fields, and then works from the end of the rest of the data, creating a key pair hash.

    It's pretty ugly, but with your data, I don't see a way around it.

    ---
    my name's not Keith, and I'm not reasonable.
Re: Parsing a log file
by graff (Chancellor) on Nov 27, 2007 at 19:15 UTC
    You said:
    I need to extract the data starting with "msg=" and ending just before "Src".

    If that's really all you need to do, then a regex like this would extract just that portion:

    while (<INPUT>) { my ( $msg ) = ( /msg=(.*?) Src \d+ / ); # do something with $msg... }
    If you need other portions of the log entry as well, there are many ways to approach the task... Something like this should be easy to maintain:
    while (<INPUT>) { next unless ( s/^([\d-]+)\s+([\d:]+)\s+(\S+)\s+// ); chomp; my ( $date, $time, $ws2k ) = ( $1, $2, $3 ); # remainder of log string contains an IP addess followed by # a set of "key=value string " tuples of various sizes my ( $ip, %flds ) = split /\s+(\w+)=/; # parens keep key strings +in split output # do stuff with %flds and other vars... }
    Some of the hash values that end up in %flds may need further conditioning (e.g. removing quotes), but this while loop does a thorough parse of each log entry.

    UPDATE: The latter while loop will do the wrong thing if it ever turns out that one of the log field values happens to contain a substring that matches "\w+=" (the split condition). If that's a valid risk, you could assign the result from split to an array, then build the hash from the array, based on your own prior knowledge about what the key strings are supposed to be (and what order they are supposed to be in).

Re: Parsing a log file
by Nkuvu (Priest) on Nov 27, 2007 at 17:03 UTC

    Two things I'd personally suggest.

    One, do a non-greedy match with .*? (although I'd wonder if there's no data -- perhaps use .+? instead). Not critical, just precautionary.

    And two, verify your match (this could probably be more elegant, but just to illustrate the point):

    my $msg; if (/msg=(.*?) Src/) { $msg = $1; } #else { # possible error message here, whatever is appropriate #}

Re: Parsing a log file
by johngg (Canon) on Nov 27, 2007 at 19:55 UTC
    Your lines have four non-space items followed by a series of this=that pairs where that could contain spaces. I would first split on whitespace using the third argument to limit the split to five fields. I would then use a global regex match to pull out the thises and thats from the fifth field as key/value pairs to populate a hash. The regex uses a look-ahead to avoid consuming the next pair. I use Data::Dumper here to show what has been parsed from the file.

    use strict; use warnings; use Data::Dumper; my $rxExtractFields = qr {(?x) \s* (\S+) = \s* (\S.*?) (?= \s*\S+= | \z ) }; open my $inFH, q{<}, \ <<'END_OF_FILE' or die qq{open: $!\n}; 2007-11-16 16:04:33 Local1.Alert 128.29.29.40 id=firewall tim +e="2007-11-16 16:04:08" fw=WS2000-Store 29 pri=1 proto=6(tcp) src=128 +.29.29.200 dst=128.29.100.102 mid= 1013 mtp= 2 msg=TCP connection re +quest received is invalid, dropping packet Src 23 Dst 4412 from EXT n +/w agent=Firewall 2007-11-16 16:05:05 Local1.Alert 128.24.24.40 id=firewall tim +e="2007-11-16 16:03:25" fw=WS2000-Store 24 pri=1 proto=6(tcp) src=128 +.24.24.200 dst=128.24.100.101 mid= 1013 mtp= 2 msg=TCP connection re +quest received is invalid, dropping packet Src 23 Dst 4344 from EXT n +/w agent=Firewall 2007-11-16 16:05:34 Local1.Alert 128.29.29.40 id=firewall tim +e="2007-11-16 16:05:09" fw=WS2000-Store 29 pri=1 proto=6(tcp) src=128 +.29.29.200 dst=128.29.100.102 mid= 1013 mtp= 2 msg=TCP connection re +quest received is invalid, dropping packet Src 23 Dst 4412 from EXT n +/w agent=Firewall 2007-11-16 16:05:39 Local1.Alert 128.2.2.40 id=firewall time= +"2007-11-16 16:03:36" fw=WS2000-Store 02 pri=1 proto=6(tcp) src=128.2 +.2.200 dst=128.2.100.106 mid= 1013 mtp= 2 msg=TCP connection request + received is invalid, dropping packet Src 23 Dst 4631 from EXT n/w ag +ent=Firewall 2007-11-16 16:05:40 Local1.Alert 128.2.2.40 id=firewall time= +"2007-11-16 16:03:36" fw=WS2000-Store 02 pri=1 proto=6(tcp) src=128.2 +.2.200 dst=128.2.100.106 mid= 1013 mtp= 2 msg=TCP connection request + received is invalid, dropping packet Src 23 Dst 4631 from EXT n/w ag +ent=Firewall 2007-11-16 16:05:40 Local1.Alert 128.2.2.40 id=firewall time= +"2007-11-16 16:03:37" fw=WS2000-Store 02 pri=1 proto=6(tcp) src=128.2 +.2.200 dst=128.2.100.106 mid= 1013 mtp= 2 msg=TCP connection request + received is invalid, dropping packet Src 23 Dst 4631 from EXT n/w ag +ent=Firewall END_OF_FILE my @parsedData = (); while ( <$inFH> ) { chomp; my ( $date, $time, $type, $ip, $restOfLine ) = split m{\s+}, $_, 5; my %pairs = $restOfLine =~ m{$rxExtractFields}g; push @parsedData, { field1 => $date, field2 => $time, field3 => $type, field4 => $ip, %pairs, }; } close $inFH or die qq{close: $!\n}; print Data::Dumper->Dumpxs( [ \ @parsedData], [ q{*parsedData} ] );

    Here's the output.

    I hope this is of interest.

    Cheers,

    JohnGG

Re: Parsing a log file
by TStanley (Canon) on Nov 27, 2007 at 19:03 UTC
    And for curiosity's sake, here is the final script:
    #!perl use strict; my $DATE= do { my($y,$m,$d)= (localtime(time-60*60*(12+(localtime)[2]) + ))[5,4,3]; sprintf "%04d-%02d-%02d", 1900+$y, 1+$m, $d; }; open INPUT,"<","SyslogCatchAll.txt"||die "Can not open input file: $!\ +n"; open CSV,">","$DATE.csv"||die "Can not open $DATE.csv: $!\n"; print CSV "Date,Time,WS 2000,FW Date,FW Time,Store,Src IP,Src Port,Dst + IP,Dst Port,Type,Agent,Message\n"; while(<INPUT>){ $_=~m/msg=(.*) Src/; my $msg = $1; my @line = split /\s+/; my $Date = $line[0]; next if ($Date ne $DATE); my $Time = $line[1]; my $ws2k = $line[3]; my $FWD = $line[5]; my $FWT = $line[6]; my $store = $line[8]; my $src_ip = $line[11]; my $dst_ip = $line[12]; my $src_prt = $line[$#line - 6]; my $dst_prt = $line[$#line - 4]; my $type = $line[$#line - 2]; my $agent = $line[$#line]; chomp $agent; $agent=~s/agent=//; $FWD=~s/time="//; $FWT=~s/"//; $src_ip=~s/src=//; $dst_ip=~s/dst=//; $msg=~s/,/ /; print CSV "$Date,$Time,$ws2k,$FWD,$FWT,$store,$src_ip,$src_prt,$dst_ +ip,$dst_prt,$type,$agent,$msg\n"; } close INPUT; close CSV;

    TStanley
    --------
    People sleep peaceably in their beds at night only because rough men stand ready to do violence on their behalf. -- George Orwell

      I have to admit, I'm confused.

      Why ask for suggestions, then disregard them all and make the script exactly like you had in the opening post? Obviously if it's working for you then that's great, but I'm missing the point of asking for advice just to ask.

      And why make a new note for the final script rather than just updating the original node with something like "update: went with the script as is" ?

Re: Parsing a log file
by ilcylic (Scribe) on Nov 27, 2007 at 18:58 UTC
    You could do:
    @logparts = split /msg=/, $line; $msg_bit = (split / Src/, $logparts[1])[0];
    Or do you need the rest of the line split up, too?

    Edit: Reading more carefully, I see you do need the rest of it. Still, the above should help you extract the msg= component, and you still have the rest of the pieces. Don't forget that you lose the record separator you use to split on, if that 'msg=' or 'Src' are actually important to your report.
Re: Parsing a log file
by sundialsvc4 (Abbot) on Nov 27, 2007 at 23:06 UTC

    I know that this is a Perl neighborhood, but this is exactly the sort of task that might be handled very well by awk. There are many tools in the toolbox.

Re: Parsing a log file
by gamache (Friar) on Nov 27, 2007 at 16:38 UTC
    Your regex looks about right to me. Are you having problems with it?
Re: Parsing a log file
by grizzley (Chaplain) on Nov 29, 2007 at 14:41 UTC

    This is not the answer or help in the problem but hint how to make things other way.

    You could consider using -n or -p switch to ease your script. I use it all the time and it really saves me much thinking about file names and handles and other stuff like error handling.

    1. Reading. Instead of

    #!perl use strict; open INPUT,... while(<INPUT>) { ...code inside loop... }
    and running like this:
    C:\>myscript.pl
    use this
    #!perl -n use strict; ...code inside loop...
    and run like this
    C:\>myscript.pl input_file.txt

    2. Writing. Instead of opening CSV file and printing to it, print to STDOUT. And redirect to file like this:
    C:\>myscript.pl > OUTPUT.csv
    Combine those two together(-p switch will print $_ for you):

    #!perl -n use strict; ...code to do with every line of input stored in $_... print $_; # or just print
    or
    #!perl -p use strict; ...code to do with every line of input stored in $_...
    Regardless which switch you choose, use the script like this:
    C:\>myscript.pl input_file.txt > OUTPUT.csv