thezip has asked for the wisdom of the Perl Monks concerning the following question:
Hello all,
I have a log parsing problem, and I seek suggestions as to a reasonable Perlish solution -- I'm not really looking for any code per se, just algorithmic "advice".
First, I'll address the things that are given and cannot be changed.
Each day, I collect a dump from a logfile generator, which is the accumulation of all log entries since the beginning of that month. Each day, a new file is collected, and is theoretically at least as big as the previous day's file. I do not have the ability to directly control this "logfile source", so I must deal with the cumulative nature of the resulting files.
Occasionally, through magic processes that I also have no control over, there may a purging of the "logfile source", which consequently causes the next day's cumulative file size to restart from 0 bytes, and then contain only what was collected after the purge.
My program must "reconstruct" all of the unique log entries for the given month for a given server.
Assumptions:
- A log entry is discrete, and can be uniquely identified by its timestamp (and potentially other key data if necessary)
- Time and space are of minimal consequence, although I'd like this to run in a reasonable amount of time
- There are 10 servers, each which will generate one log file per day of month
- There are approximately 5000 log entries available at the end of each month for a server
- Each log entry is approximately 250 chars in length
In summary, there will be around 310 files, each having size somewhat over 1.2 MB -- nothing major. Each server will have its logs unique-ified into its own file.
Certainly, in Unix, I could do something like:
1) Concatenate files into a single file
2) Then do: `sort -u <concatfile> > <sortedfile>`
... but I suspect this will eventually live on a Windows box.
I thought that maybe I could do an MD5 digest for each log entry, and then use that as a hash key for subsequent collision checks (ie. ignore all subsequent redundancies).
Thoughts?
Where do you want *them* to go today?
Re: Logfile parsing across redundant files
by BrowserUk (Patriarch) on Feb 02, 2007 at 06:45 UTC
|
Certainly, in Unix, I could do something like:
1) Concatenate files into a single file
2) Then do: `sort -u <concatfile> > <sortedfile>`
... but I suspect this will eventually live on a Windows box.
Get your self the sort program from UnxUtils. It supports -u which appears to be all you require to use your unix solution on a windows box:
c:\>u:sort --help
Usage: u:sort [OPTION]... [FILE]...
Write sorted concatenation of all FILE(s) to standard output.
Ordering options:
Mandatory arguments to long options are mandatory for short options to
+o.
-b, --ignore-leading-blanks ignore leading blanks
-d, --dictionary-order consider only blanks and alphanumeric ch
+aracters
-f, --ignore-case fold lower case to upper case characters
-g, --general-numeric-sort compare according to general numerical v
+alue
-i, --ignore-nonprinting consider only printable characters
-M, --month-sort compare (unknown) < `JAN' < ... < `DEC'
-n, --numeric-sort compare according to string numerical va
+lue
-r, --reverse reverse the result of comparisons
Other options:
-c, --check check whether input is sorted; do not sort
-k, --key=POS1[,POS2] start a key at POS1, end it at POS 2 (orig
+in 1)
-m, --merge merge already sorted files; do not sort
-o, --output=FILE write result to FILE instead of standard o
+utput
-s, --stable stabilize sort by disabling last-resort co
+mparison
-S, --buffer-size=SIZE use SIZE for main memory buffer
-t, --field-separator=SEP use SEP instead of non- to whitespace tran
+sition
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR
+or c:/temp
multiple options specify multiple direct
+ories
-u, --unique with -c: check for strict ordering
otherwise: output only the first of an e
+qual run
-z, --zero-terminated end lines with 0 byte, not newline
--help display this help and exit
--version output version information and exit
POS is F[.C][OPTS], where F is the field number and C the character po
+sition
in the field. OPTS is one or more single-letter ordering options, whi
+ch
override global ordering options for that key. If no key is given, us
+e the
entire line as the key.
SIZE may be followed by the following multiplicative suffixes:
% 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z,
+ Y.
With no FILE, or when FILE is -, read standard input.
*** WARNING ***
The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses
native byte values.
Report bugs to <bug-textutils@gnu.org>.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
| [reply] |
|
#! perl -slw
use strict;
my $dir = $ARGV[ 0 ] || die 'Need a directory';
my %hash;
while( my $file = <"$dir/*.log"> ) {
open my $fh, '<', $file or die "$file : $!";
while( <$fh> ) {
$hash{ $_ } = 1;
}
close $fh;
}
open my $fh, '>', "$dir/composite.log" or die $!;
print $fh $_ for sort keys %hash;
close $fh;
This assumes that all 31 log files from a particular server are located in a single directory, no other files are in that directory, and that the lines can be sorted using an alphanumeric sort. Eg. Each line carries a date/time stamp at the beginning of the line, and it is ordered in some sensible form (YYYYMMDD HH:MM:SS) that will sort correctly.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
|
| [reply] |
|
Hmm. I hadn't realised that the links didn't work. I just searched for them, but didn't check them as I have had my copies for years. And a quick browse doesn't turn up any reason either?
If you want a copy, /msg me an email address and I'll forward it to you.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Re: Logfile parsing across redundant files
by roboticus (Chancellor) on Feb 02, 2007 at 12:04 UTC
|
thezip:
Perhaps you could simply keep track of the file position of the last log entry you've processed, and then process all entries found after that. If the file position is smaller than the previous day, then take all the lines. Something like (untested top-of-the-head) this:
# Read logfile names with last position read yesterday
open $inf, "<", "loglist.txt" or die;
while (<$inf>) {
chomp;
my ($fname, $fpos) = split /\|/;
$logs{$fname} = $fpos;
}
close $inf;
# Get new lines from each file
for my $fname (keys %logs) {
open $ouf, '>>', $fname . ".cumulative" or die;
open $inf, '<', $fname or die;
if ($logs{$fname} < stat($inf)[7]) {
# Continue from where we left off yesterday
seek $inf, $logs{$fname}, SEEK_SET;
}
else {
# start at beginning of file
}
while (<$inf>) {
print $ouf $_;
}
$logs{$fname} = tell $inf;
close $inf;
close $ouf;
}
# Rewrite list of files and positions
open $ouf, ">", "loglist.txt" or die;
print $ouf join "\n", map { $_ . '|' . $logs{$_} } keys %logs;
close $ouf;
--roboticus
| [reply] [d/l] |
|
Doesn't that leave a hole in the logic where today's file was truncated, but grew to be larger than yesterday's file?
For example a busy Monday after a quiet Sunday?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
| [reply] |
|
|