Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Logfile parsing across redundant files

by thezip (Vicar)
on Feb 02, 2007 at 06:08 UTC ( [id://597888]=perlquestion: print w/replies, xml ) Need Help??

thezip has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I have a log parsing problem, and I seek suggestions as to a reasonable Perlish solution -- I'm not really looking for any code per se, just algorithmic "advice".

First, I'll address the things that are given and cannot be changed.

Each day, I collect a dump from a logfile generator, which is the accumulation of all log entries since the beginning of that month. Each day, a new file is collected, and is theoretically at least as big as the previous day's file. I do not have the ability to directly control this "logfile source", so I must deal with the cumulative nature of the resulting files.

Occasionally, through magic processes that I also have no control over, there may a purging of the "logfile source", which consequently causes the next day's cumulative file size to restart from 0 bytes, and then contain only what was collected after the purge.

My program must "reconstruct" all of the unique log entries for the given month for a given server.

Assumptions:

  1. A log entry is discrete, and can be uniquely identified by its timestamp (and potentially other key data if necessary)
  2. Time and space are of minimal consequence, although I'd like this to run in a reasonable amount of time
  3. There are 10 servers, each which will generate one log file per day of month
  4. There are approximately 5000 log entries available at the end of each month for a server
  5. Each log entry is approximately 250 chars in length

In summary, there will be around 310 files, each having size somewhat over 1.2 MB -- nothing major. Each server will have its logs unique-ified into its own file.

Certainly, in Unix, I could do something like:

1) Concatenate files into a single file 2) Then do: `sort -u <concatfile> > <sortedfile>`

... but I suspect this will eventually live on a Windows box.

I thought that maybe I could do an MD5 digest for each log entry, and then use that as a hash key for subsequent collision checks (ie. ignore all subsequent redundancies).

Thoughts?

Where do you want *them* to go today?

Replies are listed 'Best First'.
Re: Logfile parsing across redundant files
by BrowserUk (Patriarch) on Feb 02, 2007 at 06:45 UTC
    Certainly, in Unix, I could do something like:

    1) Concatenate files into a single file

    2) Then do: `sort -u <concatfile> > <sortedfile>`

    ... but I suspect this will eventually live on a Windows box.

    Get your self the sort program from UnxUtils. It supports -u which appears to be all you require to use your unix solution on a windows box:

    c:\>u:sort --help Usage: u:sort [OPTION]... [FILE]... Write sorted concatenation of all FILE(s) to standard output. Ordering options: Mandatory arguments to long options are mandatory for short options to +o. -b, --ignore-leading-blanks ignore leading blanks -d, --dictionary-order consider only blanks and alphanumeric ch +aracters -f, --ignore-case fold lower case to upper case characters -g, --general-numeric-sort compare according to general numerical v +alue -i, --ignore-nonprinting consider only printable characters -M, --month-sort compare (unknown) < `JAN' < ... < `DEC' -n, --numeric-sort compare according to string numerical va +lue -r, --reverse reverse the result of comparisons Other options: -c, --check check whether input is sorted; do not sort -k, --key=POS1[,POS2] start a key at POS1, end it at POS 2 (orig +in 1) -m, --merge merge already sorted files; do not sort -o, --output=FILE write result to FILE instead of standard o +utput -s, --stable stabilize sort by disabling last-resort co +mparison -S, --buffer-size=SIZE use SIZE for main memory buffer -t, --field-separator=SEP use SEP instead of non- to whitespace tran +sition -T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR +or c:/temp multiple options specify multiple direct +ories -u, --unique with -c: check for strict ordering otherwise: output only the first of an e +qual run -z, --zero-terminated end lines with 0 byte, not newline --help display this help and exit --version output version information and exit POS is F[.C][OPTS], where F is the field number and C the character po +sition in the field. OPTS is one or more single-letter ordering options, whi +ch override global ordering options for that key. If no key is given, us +e the entire line as the key. SIZE may be followed by the following multiplicative suffixes: % 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, + Y. With no FILE, or when FILE is -, read standard input. *** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values. Report bugs to <bug-textutils@gnu.org>.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Point well taken, TYVM.

      Now what's the Perlish way to solve this?

      Where do you want *them* to go today?

        On the basis of what you've said about the data, it could be as simple as this:

        #! perl -slw use strict; my $dir = $ARGV[ 0 ] || die 'Need a directory'; my %hash; while( my $file = <"$dir/*.log"> ) { open my $fh, '<', $file or die "$file : $!"; while( <$fh> ) { $hash{ $_ } = 1; } close $fh; } open my $fh, '>', "$dir/composite.log" or die $!; print $fh $_ for sort keys %hash; close $fh;

        This assumes that all 31 log files from a particular server are located in a single directory, no other files are in that directory, and that the lines can be sorted using an alphanumeric sort. Eg. Each line carries a date/time stamp at the beginning of the line, and it is ordered in some sensible form (YYYYMMDD HH:MM:SS) that will sort correctly.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Having hit this just last night I thought I'd ask: how long has UnxUtils actually been unavailable? Clicking either of the .zip links gets you "You don't have permission to access /UnxUpdates.zip on this server."

      I'd love to equip my brethren des fenêtres with some basic tools (like 'wc') but don't want to ask them to do Cygwin.

        Hmm. I hadn't realised that the links didn't work. I just searched for them, but didn't check them as I have had my copies for years. And a quick browse doesn't turn up any reason either?

        If you want a copy, /msg me an email address and I'll forward it to you.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Logfile parsing across redundant files
by roboticus (Chancellor) on Feb 02, 2007 at 12:04 UTC
    thezip:

    Perhaps you could simply keep track of the file position of the last log entry you've processed, and then process all entries found after that. If the file position is smaller than the previous day, then take all the lines. Something like (untested top-of-the-head) this:

    # Read logfile names with last position read yesterday open $inf, "<", "loglist.txt" or die; while (<$inf>) { chomp; my ($fname, $fpos) = split /\|/; $logs{$fname} = $fpos; } close $inf; # Get new lines from each file for my $fname (keys %logs) { open $ouf, '>>', $fname . ".cumulative" or die; open $inf, '<', $fname or die; if ($logs{$fname} < stat($inf)[7]) { # Continue from where we left off yesterday seek $inf, $logs{$fname}, SEEK_SET; } else { # start at beginning of file } while (<$inf>) { print $ouf $_; } $logs{$fname} = tell $inf; close $inf; close $ouf; } # Rewrite list of files and positions open $ouf, ">", "loglist.txt" or die; print $ouf join "\n", map { $_ . '|' . $logs{$_} } keys %logs; close $ouf;
    --roboticus
        BrowserUk:

        Good catch! I guess we'd have to add some code to cache the first line as well. If the first line is the same, do the same thing as above. If not, then cache the first line and take the whole file.

        --roboticus

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://597888]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-04-24 00:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found