How do I read a log file that contents recurring log messages those are separated by newline characters?

WantToBeJediInPerl has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I've been trying to think of a way to do this for a while and I'm stuck. I do coding in C but pretty new in Perl.

This is want I am trying to achieve :

I have a log file that contain log messages and separated by newline characters. Need to write a Perl program that finds the top 8 most reappeared log messages? Please note that the log file might be too big to fit in the memory at one time.

Think about that the log file has a format similar to Linux syslog format as follows:

     Mar  9 08:15:05  gen-vcs11 kernel: kjslah: Unknown symbol unlock_
+page
     Mar  9 08:15:05  gen-vcs11 kernel: kjslah: Unknown symbol generic
+_file_read
     Mar  9 08:15:05  gen-vcs11 kernel: kjslah: Unknown symbol generic
+_file_write
     Mar  9 08:15:05  gen-vcs11 kernel: kjslah: Unknown symbol generic
+_file_mmap
     Mar  9 08:15:05  gen-vcs11 kernel: kjslah: Unknown symbol generic
+_file_sendfile
     Mar  9 08:15:05  gen-vcs11 kernel: kjslah: disagrees about versio
+n of symbol zone_table
     Mar  9 08:15:05  gen-vcs11 kernel: kjslah: Unknown symbol zone_ta
+ble
     Mar  9 08:15:05  gen-vcs11 kernel: kjslahdisagrees about version 
+of symbol unlock_page
Mar  9 08:15:05  gen-vcs11 kernel: kjslah: Unknown symbol unlock_page
     Mar  9 08:15:05  gen-vcs11 kernel: kjslah: Unknown symbol filemap
+_fdatawrite
     Mar  9 08:15:05  gen-vcs11 kernel: kjslah: Unknown symbol find_or
+_create_page
[download]

so on... Any help would be greatly appreciated! Thanks!

Comment on How do I read a log file that contents recurring log messages those are separated by newline characters? Download Code

Replies are listed 'Best First'.
Re: How do I read a log file that contents recurring log messages those are separated by newline characters? by toolic (Bishop) on Oct 14, 2010 at 19:22 UTC
Store your messages as keys in a hash, and increment the count. use strict; use warnings; my %msgs; while (<DATA>) { s/^\s+//; chomp; $msgs{$_}++; } # Sort by number of occurrences and only show top 8: my $i = 0; for my $m (sort {$msgs{$b} <=> $msgs{$a}} keys %msgs) { print "$msgs{$m} $m\n"; $i++; last if $i == 8; } __DATA__ Mar 9 08:15:05 gen-vcs11 kernel: kjslah: Unknown symbol unlock_ +page Mar 9 08:15:05 gen-vcs11 kernel: kjslah: Unknown symbol generic +_file_read Mar 9 08:15:05 gen-vcs11 kernel: kjslah: Unknown symbol generic +_file_write Mar 9 08:15:05 gen-vcs11 kernel: kjslah: Unknown symbol generic +_file_mmap Mar 9 08:15:05 gen-vcs11 kernel: kjslah: Unknown symbol generic +_file_sendfile Mar 9 08:15:05 gen-vcs11 kernel: kjslah: disagrees about versio +n of symbol zone_table Mar 9 08:15:05 gen-vcs11 kernel: kjslah: Unknown symbol zone_ta +ble Mar 9 08:15:05 gen-vcs11 kernel: kjslahdisagrees about version +of symbol unlock_page Mar 9 08:15:05 gen-vcs11 kernel: kjslah: Unknown symbol unlock_page Mar 9 08:15:05 gen-vcs11 kernel: kjslah: Unknown symbol filemap +_fdatawrite Mar 9 08:15:05 gen-vcs11 kernel: kjslah: Unknown symbol find_or +_create_page [download] See also: perlintro `perldoc -q sort`	[reply] [d/l] [select]
Re^2: How do I read a log file that contents recurring log messages those are separated by newline characters? by TomDLux (Vicar) on Oct 15, 2010 at 14:09 UTC
You are including the time stamps in the key, so identical events a second apart increment separate counts. or maybe you were showing the overall concept, and leaving the trimming as an exercise for the student? It looks like gen-vcs11 kernel is a standard component of every line, so I would ignore it. Using split to extract the second and third components, and using those as a key: `my ( $code, $msg ) = ( split, ':', $_)[2,3]; $msgs{$code}{$msg}++;` [download] It becomes even simplar if you only want to preserve the msg component. As Occam said: Entia non sunt multiplicanda praeter necessitatem.	[reply] [d/l]
Re^3: How do I read a log file that contents recurring log messages those are separated by newline characters? by toolic (Bishop) on Oct 15, 2010 at 15:51 UTC
or maybe you were showing the overall concept, and leaving the trimming as an exercise for the student? Yes.	[reply]
Re: How do I read a log file that contents recurring log messages those are separated by newline characters? by pileofrogs (Priest) on Oct 14, 2010 at 19:16 UTC
This is what perl is great for. Probably the most direct approach would be to parse each line with a regex to get the parts you care about (IE not the timestamp). Create a hash where the key is the relevant part of the line and the value is a number that you increment every time you find the same message. Notice you're not keeping the whole file in memory, just one instance of each line and a number. --Pileofrogs	[reply]
Re: How do I read a log file that contents recurring log messages those are separated by newline characters? by wwe (Friar) on Oct 15, 2010 at 14:25 UTC
maybe you want visit this excellent site: (for filtered view: http://www.loganalysis.org/) which discusses general strategies and also holds some (perl) programs for log analysis.	[reply]


Think about Loose Coupling
	PerlMonks