http://qs321.pair.com?node_id=527058

matt.tovey has asked for the wisdom of the Perl Monks concerning the following question:

An application-programmer colleague of mine, noting that I program in Perl, asked me for a script to help him analyse log files. How could I refuse?

Problem is, my solution needs the whole logfile in memory, and these things can be quite large (100MB), so I'm wondering if there's a better way. No, strike that, I'm wondering what the better way is. :)

The logfile consists of the locking and unlocking of various mutex's, where my colleague wants to disregard (i.e. remove) all references to locks which are later unlocked. I'm doing this so:

my @lines; # Array of lines to be kept for output +. push(@lines, ''); # Preload output array with null line +0 (for correct line counts). my %locks; # Hash of currently open locks. while (<STDIN>) { my $count = push(@lines, $_) - 1; Log 2, "Analysing line $count"; if ($_ =~ /Mutex\((.*)\)::(\w+)/) { # Regexp to ob +tain address and action info from line. Log 2, "Address and action parsed correctly."; my $address = $1; my $action = $2; if ($action eq 'locking') { Log 2, "Address $address locked at line $count"; if (defined $locks{$address}) { Log 0, "ERROR: Address $address locked at line $count, but alr +eady locked at line $locks{$address}."; } $locks{$address} = $count; } if ($action eq 'unlocked') { Log 2, "Address $address unlocked at line $count"; unless (defined $locks{$address}) { Log 0, "ERROR: Address $address not locked, but unlocked at li +ne $count."; } else { $lines[$locks{$address}] = ''; Log 1, "Found a match for address $address: locked $locks{$add +ress}, unlocked $count. Removing from output."; } $locks{$address} = ''; $lines[$count] = ''; } } } foreach (@lines) {print}

Nearly all of the locks that are set will be cleared close to the time that they are set, so the memory-use could be significantly reduced by a solution that frees memory no longer required. I was (very optimistically!) hoping that setting array entries to '' would free the memory, but a bit of profiling shows me that it's not so.

So, what would be a better solution? Here are some ideas that I had:

- splice lines no longer required out of @lines.
Con: I'd have to go through %locks, changing the line numbers each time I remove a line.

- grep backwards through @lines when I find an 'unlocked' message.
Cons: Probably much slower, and I'd no longer pick up on the 'Address already locked' error condition that I find in my solution.

Update:
Sorry, here's some example data:

DEBUG   : MUTEX       : : Mutex(0x30080002 + 0)::locking
DEBUG   : MUTEX       : : Mutex(0x30080002 + 2)::locking
DEBUG   : MUTEX       : : Mutex(0x30080002 + 2)::unlocked
DEBUG   : MUTEX       : : Mutex(0x30080002 + 4)::locking
DEBUG   : MUTEX       : : Mutex(0x30080002 + 2)::locking
DEBUG   : MUTEX       : : Mutex(0x30080002 + 2)::unlocked
DEBUG   : MUTEX       : : Mutex(0x30080002 + 4)::unlocked
In this case, all but the first line should be removed.

Update 2:
Thanks to everyone for your help - learning about '$.' has alone made this all worthwhile for me!

The contributions from hv and meetraz are much more efficient, and run much faster too. Thanks! However, I'll have to ask my colleague if he actually needs to see the lines which weren't removed (which I simplified somewhat in the sample data above - sorry!), or if such a report on the locks will suffice.

Update 3: :(
Merely replacing $count with $. increases my run-time by 30% (65MB testfile, lines stored in a hash, 17 -> 22 seconds)! Who'd have thought?

Replies are listed 'Best First'.
Re: Reducing memory usage while matching log entries
by duff (Parson) on Feb 01, 2006 at 14:53 UTC
    I'm wondering what the better way is. :)

    Rather than storing the whole file in memory, use a 2 pass solution. In the first pass, gather the locking information. In the second pass, remove the appropriate lines. Each pass will consist of just a while loop that only holds one line at a time in memory.

    Update: Another option would be to use Tie::File. It lets you treat the file as if it were an array but manages all of the gory details for you. I just looked at the man page and it has an option to limit the amount of memory it consumes.

      I like this idea but I'm having trouble get a 2 pass solution to work nicely. In the first pass I identify which line numbers need to be removed. But then I'll need to mung that list into some convenient form to use while reading the file again in the second pass...

      Tie::File looks quite handy, but since I don't want to alter the original log file, I'd have to copy it first and then reduce it, which could be a problem if disk-space is tight.

      So, so far storing the whole file in a hash so that I can properly delete the lines no longer required seems like a winner.

      Salva: The 'sort' idea has the problem that a given lock ID can be locked and unlocked multiple times in the file, so the sorted valued won't always be 'locking' follwed by 'unlocked'. Plus the contents of the logfile need to remain in the correct order for analysis of the remaining contents... but thanks!

        I like this idea but I'm having trouble get a 2 pass solution to work nicely. In the first pass I identify which line numbers need to be removed. But then I'll need to mung that list into some convenient form to use while reading the file again in the second pass...

        I don't understand your difficulty. Pass #1 records line numbers, pass #2 writes all the lines that haven't been recorded. Here's some code based on your original but with a few tweaks:

        #!/usr/bin/perl use strict; use warnings; sub Log; die "Usage: $0 <filename>\n" unless @ARGV == 1; my $logfile = shift; # Pass #1 : gather line numbers to be deleted. my %locks; # Hash of currently open locks. my @unlock_lines; # lines to rid ourselves of open(LOGFILE,$logfile) or die "Can't read $logfile - $!\n"; while (<LOGFILE>) { Log 2, "Analysing line $."; next unless /Mutex\((.*?)\)::(\w+)/; my ($address,$action) = ($1,$2); if ($action eq 'locking') { Log 2, "Address $address locked at line $."; if (defined $locks{$address}) { Log 0, "ERROR: Address $address locked at line $., but already l +ocked at line $locks{$address}."; } $locks{$address} = $.; } if ($action eq 'unlocked') { Log 2, "Address $address unlocked at line $."; unless (defined $locks{$address}) { Log 0, "ERROR: Address $address not locked, but unlocked at lin +e $.."; } else { push @unlock_lines, $., delete $locks{$address}; } } } close LOGFILE; # Sort the lines numbers that we've accumulated because we put them in # unordered. This allows us to make just one more pass through the fil +e # to remove the lines. @unlock_lines = sort { $a <=> $b } @unlock_lines; # Pass #2: output all but the lines we're not interested in. my $rmline = shift @unlock_lines; open(LOGFILE,$logfile) or die "Can't read $logfile - $!\n"; while (<LOGFILE>) { if (defined $rmline && $. == $rmline) { $rmline = shift @unlock_lines; next; } print; } close LOGFILE;
        The 'sort' idea has the problem that a given lock ID can be locked and unlocked multiple times in the file, so the sorted valued won't always be 'locking' follwed by 'unlocked'.

        That shouldn't be a problem as long as you use an stable sort implementation or alternatively the line number or a timestamp as the secondary sorting key.

Re: Reducing memory usage while matching log entries
by tirwhan (Abbot) on Feb 01, 2006 at 15:06 UTC

    Take a look at Tie::File, this allows you to treat the file as an array without loading the entire file into memory.


    There are ten types of people: those that understand binary and those that don't.
Re: Reducing memory usage while matching log entries
by Fletch (Bishop) on Feb 01, 2006 at 14:54 UTC

    Yeah, without sample data it's hard to give advice. But the correct way to remove an element from a hash is with delete, not by assigning an empty string. And you're doing something really weird with @lines; if you get to line 100,000 it's going to allocate a 100,000 element array. Considering it's probably sparse you should use a hash for that as well.

    If you're still running into memory issues you might look at using a tied hash (DB_File or BerkeleyDB) or even DBD::SQLite to store stuff you've seen somewhere easily retrieved. Those wouldn't keep much of anything in memory but fetching to see the state of a given address should still be pretty quick.

      And you're doing something really weird with @lines; if you get to line 100,000 it's going to allocate a 100,000 element array. Considering it's probably sparse you should use a hash for that as well.

      Yes, that's the file being held in memory. :(

      I just tried putting that into a hash (also using 'delete' now - thanks!). It runs only slightly slower, and consumes quite a bit less memory, so success there!

Re: Reducing memory usage while matching log entries
by salva (Canon) on Feb 01, 2006 at 15:27 UTC
    An easy way is to sort the file on disk using the lock id as the primary sorting key with some CPAN module as Sort::External, Sort::Key::Merger or even the sort command.

    After that, assuming nested locks are not allowed, every lock entry should be followed by an unlock one, something very easy to parse!

Re: Reducing memory usage while matching log entries
by neilwatson (Priest) on Feb 01, 2006 at 14:47 UTC
    Perhaps providing some sample data would help us help you.

    Neil Watson
    watson-wilson.ca

Re: Reducing memory usage while matching log entries
by hv (Prior) on Feb 01, 2006 at 18:28 UTC

    I don't understand why you are using the @lines array at all, and without that your memory requirements go way down. The logic would be something like:

    • %locks records currently open locks, with the key being the mutex address and the value being the input file line number where the lock was acquired
    • if a 'locking' line is seen for a locked mutex, report it and overwrite the %locks entry with the newer lock
    • if an 'unlocked' line is seen for a locked mutex, remove the %locks entry
    • if an 'unlocked' line is seen for an unlocked mutex, report it
    • when the input file is exhausted, report all remaining entries in %locks as "never unlocked"

    Your existing code is close to that: remove the references to @lines; replace the definition of $count with:

    my $count = $.; # input line number
    change the action for unlocking to:
    delete $locks{$address};
    and replace the final foreach (@lines) {print} with:
    for (sort keys %locks) { Log 0, "ERROR: address $_ locked at line $locks{$_} but never unlock +ed."; }

    Hope this helps,

    Hugo

Re: Reducing memory usage while matching log entries
by meetraz (Hermit) on Feb 01, 2006 at 18:30 UTC
    Try this:
    use strict; my %locks; while (chomp(my $line = <STDIN>)) { next unless ($line =~ /Mutex(.*)::(.*)/); my ($mutex,$action) = ($1,$2); if ($action eq 'locking') { if ($locks{$mutex}) { printf ("Mutex %s already locked on line %d, locked again +on line %d\n", $mutex, $locks{$mutex}, $.); } else { $locks{$mutex} = $.; } } elsif ($action eq 'unlocked') { if ($locks{$mutex}) { delete $locks{$mutex}; } else { printf ("Mutex %s unlocked on line %d, but wasn't locked.\ +n", $mutex, $.); } } } foreach my $mutex (keys %locks) { printf ("Mutex %s locked on line %d, but wasn't unlocked\n", $mutex, $locks{$mutex}); }
Re: Reducing memory usage while matching log entries
by meetraz (Hermit) on Feb 01, 2006 at 18:15 UTC
    BTW, you don't have to track the current line number yourself (or prepend an empty string to your array so it doesn't start with zero).
    Just use $. (Yes, dollar sign followed by period) it will contain the current line number of the last read filehandle.