Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

virus log parser

by phaedo (Initiate)
on Jul 02, 2002 at 20:04 UTC ( [id://178983]=perlquestion: print w/replies, xml ) Need Help??

phaedo has asked for the wisdom of the Perl Monks concerning the following question:

I'm a perl-novice... trying to expand my perl knowledge
beyond system administration. I have an anti-virus log file that I would like to eventually put into a mysql database; but I'm having problems parsing it. The log file is a large flat file with "----" lines as record seperators. Here's an example...

From: pminich@foo.com
To: esquared@foofoo.com
File: value.scr
Action: The uncleanable file is deleted.
Virus: WORM_KLEZ.H
----------------------------------
Date: 06/30/2002 00:01:21
From: mef@mememe.com
To: inet@microsoft.com
File: Nr.pif
Action: The uncleanable file is deleted.
Virus: WORM_KLEZ.H
----------------------------------
...
...

I'm trying to place each record on one line (for a sql load). The list screams multi-dimensional array with hash references. But those weren't covered in "Learning Perl" and I'm having difficulties applying examples in "The Perl Cookbook". I know to set the input field seperator to $/='---' to seperate my records, but then what? I'm not sure what to do w/ each record or how to parse it beyond that. I'm currently building a non-multi-dimensional approach with simple RE conditionals...

open(LF,"$logFile") || die "Can't open $logFile: $!\n"; open(OF,">$outputFile") || die; while ($line=<LF>) { chomp($line); if ($line =~ /^Date:\s+(\S+)\s(\S+)/) { $date=$1 . " " . $2; } if ($line=~ /^From:\s+(\S+)/) { $from=$1; } if ($line=~ /^To:\s+\S+/) { # Some "To: lines have multipe ", " # delimited addresses my ($crap, @to);<br> ($crap, @to)=split(/\s+/,$line); print OF "$date\t$from\t@to\n"; } } close(LF); close(OF);

... which is limited to simple if/then statments (the curse of predicate logic classes) and bad regular expressions. I'm really not sure what to do. I can pull my vars here, but each record is a line. Plus it doesn't give me the functionality I really need (I will eventually need to manipulate some of these fields). I was thinking of adding something like (from the Cook Book):

for $x (1 .. 10) {<br> for $y (1 .. 10) {<br> $LoL[$x][$y] = func($x, $y);<br> }

for each var so I could build my matricies; but then I would leave me with something like $foo[$a][$b][$c][$d][$e]; however, I don't know what I have here -- other than a headache. Any suggestions would be greatly appreciated; including more daily uses for references.-- Phaedo

Replies are listed 'Best First'.
Re: virus log parser
by Rhose (Priest) on Jul 02, 2002 at 20:44 UTC
    How about collecting the information, then printing the record when you get to one of the '-----' lines? (This assumes all records -- even the last one -- end with a '-----' line.)

    The following code reads from __DATA__ and writes its (tab delimited) records to the screen; you would probably want to open your log file for processing (open(LF,"$logFile")), and write to a results file (open(OF,">$outputFile")).

    #!/usr/bin/perl -w use strict; my $gCurRec; foreach(qw(name to file action virus)) { $gCurRec->{$_}=''; } while(<DATA>) { $gCurRec->{name}=$1 if (/^From:\s*(.+?)\s*$/); $gCurRec->{to}=$1 if (/^To:\s*(.+?)\s*$/); $gCurRec->{file}=$1 if (/^File:\s*(.+?)\s*$/); $gCurRec->{action}=$1 if (/^Action:\s*(.+?)\s*$/); $gCurRec->{virus}=$1 if (/^Virus:\s*(.+?)\s*$/); if (/^-----/) { print $gCurRec->{name},"\t", $gCurRec->{to},"\t", $gCurRec->{file},"\t", $gCurRec->{action},"\t", $gCurRec->{virus},"\n"; foreach(qw(name to file action virus)) { $gCurRec->{$_}=''; } } } __DATA__ From: pminich@foo.com To: esquared@foofoo.com File: value.scr Action: The uncleanable file is deleted. Virus: WORM_KLEZ.H ---------------------------------- Date: 06/30/2002 00:01:21 From: mef@mememe.com To: inet@microsoft.com File: Nr.pif Action: The uncleanable file is deleted. Virus: WORM_KLEZ.H ----------------------------------

    Comment: One other thing I found I like is opening files with three parameters. For example, instead of:

    open(OF,">$outputFile") || die;

    I use:

    open(OF,'>',$outputFile) || die;

    I hope this helps! *Smiles*

    Update:

    Now that I have re-read my code, I should have made

    qw(name to file action virus)

    a constant so it was defined but one place, and should have made the field separator a constant as well. This would simplify changes to the code. (Not that it is critical on such a small program, but it is a good practice... well, for me at least.)

      I would like to add some random thoughts I had when I saw your code.

      First of all, the construct

      foreach(qw(name to file action virus)) { $gCurRec->{$_}=''; }
      can be expressed very succinctly using so called hash slices, i.e.
      my @columns = qw(name to file action virus); @{ $gCurRec }{ @columns } = ('') x @columns;
      See for example this for a good introduction.

      Furthermore, why do you use a hash reference to store the data when a hash would be sufficient? (This is probably a matter of style.)

      Then, I usually consider multiple repeated lines with trivial differences like

      $gCurRec->{name}=$1 if (/^From:\s*(.+?)\s*$/); $gCurRec->{to}=$1 if (/^To:\s*(.+?)\s*$/);
      to be a sign that some kind of abstraction like a loop is needed. In this case, keying each datum by its header field
      /^(\w+):\s*(.+?)\s*$/ and $gCurRec->{$1} = $2;
      does so and furthermore removes the need to spell out the interesting header fields several times. This of course means that unknown fields like the Date: are ignored, but your code ignores them as well.

      So finally here is my attempt at implementing your algorithm:

      #!/usr/bin/perl -w use strict; my %gCurRec = (); while(<DATA>) { /^-+\s*$/ and do { print join("\t", map { exists $gCurRec{$_} ? $gCurRec{$_} : '' } qw(from to file action virus) ) . "\n"; %gCurRec = (); next; }; /^(\w+):\s*(.+?)\s*$/ and $gCurRec{lc $1} = $2; } __DATA__ From: pminich@foo.com To: esquared@foofoo.com File: value.scr Action: The uncleanable file is deleted. Virus: WORM_KLEZ.H ---------------------------------- Date: 06/30/2002 00:01:21 From: mef@mememe.com To: inet@microsoft.com File: Nr.pif Action: The uncleanable file is deleted. Virus: WORM_KLEZ.H ----------------------------------
Re: virus log parser
by joealba (Hermit) on Jul 03, 2002 at 04:17 UTC
    Here's a little gratuitous sample code for Parse::RecDescent, seeing as I just spent Monday in a class with TheDamian teaching me all about it. :)
    use strict; use Parse::RecDescent; use Data::Dumper; my $grammar = q{ viruslog: message(s) { %{$return} = map {@{$_}} (@{$item[1]}); } message: /^(\w+):\s+ (.*)/x { $return = [lc($1), $2]; } }; my $parser = new Parse::RecDescent $grammar or die "Invalid grammar"; foreach (split /---+/, join '', <DATA>) { my $record = $parser->viruslog($_); print Dumper($record) if defined $record; } __DATA__ From: pminich@foo.com To: esquared@foofoo.com File: value.scr Action: The uncleanable file is deleted. Virus: WORM_KLEZ.H ---------------------------------- Date: 06/30/2002 00:01:21 From: mef@mememe.com To: inet@microsoft.com File: Nr.pif Action: The uncleanable file is deleted. Virus: WORM_KLEZ.H ----------------------------------
    Which prints:
    $VAR1 = { 'file' => 'value.scr', 'virus' => 'WORM_KLEZ.H', 'to' => 'esquared@foofoo.com', 'from' => 'pminich@foo.com', 'action' => 'The uncleanable file is deleted.' }; $VAR1 = { 'date' => '06/30/2002 00:01:21', 'file' => 'Nr.pif', 'virus' => 'WORM_KLEZ.H', 'to' => 'inet@microsoft.com', 'from' => 'mef@mememe.com', 'action' => 'The uncleanable file is deleted.' };
    Like the solutions above, this will give you a hash for each record to make it easy to insert into a database. But, you'll notice that I do almost no work to achieve the result. There are really only 2 lines of Perl (the codeblocks in the grammar) that actually do anything here (aside from the split)! It also will handle any new message types if they are ever added to your log.

    And, I'm sure it could be even more simple, but I don't think it's too bad for being my first prog with Parse::RecDescent. :)
Re: virus log parser
by jjohn (Beadle) on Jul 03, 2002 at 02:19 UTC

    I have an anti-virus log file that I would like to eventually put into a mysql database; but I'm having problems parsing it.

    This is exactly the class of problems for which Perl was designed. There are many ways to approach this problem, as has already been shown. I'd like to submit my quick and dirty version here. It reads through the log file (really anything on STDIN) and creates an array of hash references, suitable for sorting or iterating through to collect stats like most common email address or virus. This version is short and hopefully transparent.

    #!/usr/bin/perl use strict; use Data::Dumper; my (@log, %rec); while(<>){ if( /^-/ ){ push @log, { %rec }; %rec = (); next; } chomp; my ($k, $v) = split /\s*:\s*/, $_, 2; $rec{ $k } = $v if $k; } push @log, { %rec } if keys %rec; print Dumper(\@log);

    While slurping input, each line is checked to see if it is an "end of record" marker, which is defined here as any line beginning with a dash. If this doesn't match your reality, you will need to tinker with this line. When the end of record is found, the hash that represents that record is stuffed into the @log array. Since arrays can only hold scalar values, a hash reference is needed. Unfortunately, we can't simply use the reference made like this: \%rec because that hash will be erased on the next line! Instead, we create a brand new anonymous hash with { } and stuff that away. We clear out the "global" hash and grab the next line of input.

    If the line of input isn't an end of record line, then the newline is removed and the very potent split operate is used to separate the key from the value. This assumes the the key and value are on the same line, of course. As a defensive measure, ancillary whitespace will be consumed around the colon. The often neglicted third argument to split indicates how many fields split should produces. Even if a colon appears somewhat in the value field, it will still appear as part of the $v variable. After creating a key and a value variable ($k and $v), the record hash is populated with these values provided the key is a true value. This prevents silly things like blank or malformed lines from disturbing your hash.

    When the loop exits, you might not have pushed the last hash into the @log array (e.g. the last record separator might have gone missing on you). Therefore, a check is made to see if %rec has any keys which would cause that record to be dumped into @log.

    I use Data::Dumper merely to show that @log has been populated correctly. If you aren't familiar with Data::Dumper, do make yourself acquainted. It can be a real lifesaver.

    I leave the writing of the analysis of @log as an exercise for the reader. If references and dereferences make your head spin, take a look at Mark-Jason Dominus's Understand References Today.

    Hope this helps.

Re: virus log parser
by thpfft (Chaplain) on Jul 03, 2002 at 03:14 UTC

    I'm not sure what you were planning with the matrices: if you want to work further with this data, or move it into a database, you're probably best off pulling it into a hash, or an array-of-hashes.

    If the file is very large, or memory is limited, you may have to read the file line by line, as others have suggested, insert each completed record into the database and then use that to perform whatever analyses made you want to put them there in the first place.

    If you're more interested in a quick scan - how much klez this week? - then a AoH will be more fun. You should probably still use a cursor to read the file, though. it might be more dashing to do an enormous split on -+, but not wise. especially if you reset $/ to do it. really wouldn't do that. a little too sweeping.

    If there was a unique identifier with each record, then a HoH would be more useful: a big hash in which the keys come from your unique field and each value is another hash containing the foo=bar pairs you've extracted. The main advantage would be that you share a key with the original file, allowing (for example) incremental updates of the database.

    but there doesn't seem to be a useful hook like that, unless perhaps the events are rare enough that you don't mind assuming the timestamp for each entry is unique. So everything would go in an array instead, and the array index could serve as a makeshift id. you could still use the dates to act on only part of the file, or just invoke your script from logrotate.

    I'll assume that you're putting everything in a database first and then working with it later. this is pretty hasty, but tested and i've tried to keep it readable:

    #!/usr/bin/perl use strict; use DBI; use Data::Dumper; # decide which bits of the records you want to keep my @fields_to_store = qw(date name to file action virus); # turn that into a hash with which to screen regex matches my %field_ok = map { $_ => 1 } @fields_to_store; # and two strings for the database insert statement: one of column # names, one with the proper number of placeholders. my $field_list = join(',', @fields_to_store); my $placeholders = join(',', ('?' x scalar(@fields_to_store))); # connect to the database my $dsn = "DBI:mysql:database=xxxx;host=localhost"; my $dbh = DBI->connect($dsn, 'xxxx', 'xxxx', { 'RaiseError' => 1 }); # build the instruction that will be used to insert each record my $insert_handle = $dbh->prepare("insert into xxxx ($field_list) valu +es ($placeholders)"); # read the file. this %gather basket is crude, but effective # enough, so i offer it in the spirit of tmtowtdi my %gather; while(<DATA>) { # match data line? if (m/^(\w+):\s*(.+?)\s*$/ && $field_ok{lc $1}) { die "overwriting $1 field: broken" if exists $gather{lc $1}; $gather{lc $1} = $2; } # match dividing line? if (m/^-+\s*$/ && keys %gather) { # field order matters, of course, so use the fields_to_store array # in a map{} to order the contents of %gather, which would # otherwise be jumbled $insert_handle->execute( map { $gather{lc $_} } @fields_to_sto +re ); print Dumper \%gather; %gather = (); } } $insert_handle->finish; __DATA__ ---------------------------------- Date: 06/30/2002 00:01:21 From: pminich@foo.com To: esquared@foofoo.com File: value.scr Action: The uncleanable file is deleted. Virus: WORM_KLEZ.H ---------------------------------- Date: 06/30/2002 00:01:21 From: mef@mememe.com To: inet@microsoft.com File: Nr.pif Action: The uncleanable file is deleted. Virus: WORM_KLEZ.H ----------------------------------

    For your database to be of much use you'd really need to split the email field and store that in a separate table, with another table in between that and the main one to hold the links between log entries and addresses. By that stage it would already be worth looking for something like Class::DBI to do the drudgery for you.

Re: virus log parser
by flocto (Pilgrim) on Jul 02, 2002 at 21:40 UTC

    I would not try using the input seperator for doing things like this.. It's a lot easier to parse this line by line..:

    # open file, DB-connection, etc.. my %data; while (my $line = <INPUT>) { chomp ($line); if ($line =~ m/^-+$/) { &save (%data); %data = (); } elsif ($line =~ m/^(\w+):\s(.+)$/) { $data{lc($1)} = $2; } elsif ($debug) { warn $line; } } sub save { # up to you :) }

    You should note that the regex is not optimal. If it was as easy to read as the above I would have written m/([^:]+):\s/ with $1 and $'..Dig intp perlre when interested. Another thing to note it that you should make sure that all the keys of that hash you want to save to a database has all well defined values! Oh, last but not least: The only reason I wrote &save was to demonstrate that it is not a build-in function..

    Regards,
    -octo-

Re: virus log parser
by yodabjorn (Monk) on Jul 03, 2002 at 02:04 UTC
    Although you have a couple ways described here I decided to munge it myself. (After all it's perl and theres always more than one way! )

    This code uses an array of Hashes to represent the records. and the Lefthand identifier ( From: Date: etc ) is dynamically used for the hash keys. This is IMHO more flexable. Heres the code:
    #!/usr/bin/perl use strict ; use warnings ; use Data::Dumper ; my @records ; my $count = 0 ; while (<DATA>) { next if ( /^\n/ ) ; # skip newlines if (/^--/) # new record { $count++ ; next ; } my ( $field, @words ) = split ; # get the 2 needed fields $field =~ s/://g ; # drop the ":" my $data = join " ", @words ; # make a string chomp $data ; # remove the newline $records[$count]{$field} = $data ; } print Dumper(\@records); # easy way to unfold the structure __DATA__ From: pminich@foo.com To: esquared@foofoo.com File: value.scr Action: The uncleanable file is deleted. Virus: WORM_KLEZ.H ---------------------------------- Date: 06/30/2002 00:01:21 From: mef@mememe.com To: inet@microsoft.com File: Nr.pif Action: The uncleanable file is deleted. Virus: WORM_KLEZ.H ----------------------------------
    you can now unfold the structure by looping through @records :
    foreach my $record (@records) { # $record is now a ref to the Hash it contianed # print a field from the record print "\nNew Record\n" ; print "FROM: $$record{From}, \n" ; # or loop through each key for current record foreach my $key (sort keys %$record ) { print "$key => $$record{$key} \n" ; } }
    Hope it helps !

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://178983]
Approved by sm3g
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-25 17:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found