http://qs321.pair.com?node_id=527058

matt.tovey has asked for the wisdom of the Perl Monks concerning the following question:

An application-programmer colleague of mine, noting that I program in Perl, asked me for a script to help him analyse log files. How could I refuse?

Problem is, my solution needs the whole logfile in memory, and these things can be quite large (100MB), so I'm wondering if there's a better way. No, strike that, I'm wondering what the better way is. :)

The logfile consists of the locking and unlocking of various mutex's, where my colleague wants to disregard (i.e. remove) all references to locks which are later unlocked. I'm doing this so:

my @lines; # Array of lines to be kept for output +. push(@lines, ''); # Preload output array with null line +0 (for correct line counts). my %locks; # Hash of currently open locks. while (<STDIN>) { my $count = push(@lines, $_) - 1; Log 2, "Analysing line $count"; if ($_ =~ /Mutex\((.*)\)::(\w+)/) { # Regexp to ob +tain address and action info from line. Log 2, "Address and action parsed correctly."; my $address = $1; my $action = $2; if ($action eq 'locking') { Log 2, "Address $address locked at line $count"; if (defined $locks{$address}) { Log 0, "ERROR: Address $address locked at line $count, but alr +eady locked at line $locks{$address}."; } $locks{$address} = $count; } if ($action eq 'unlocked') { Log 2, "Address $address unlocked at line $count"; unless (defined $locks{$address}) { Log 0, "ERROR: Address $address not locked, but unlocked at li +ne $count."; } else { $lines[$locks{$address}] = ''; Log 1, "Found a match for address $address: locked $locks{$add +ress}, unlocked $count. Removing from output."; } $locks{$address} = ''; $lines[$count] = ''; } } } foreach (@lines) {print}

Nearly all of the locks that are set will be cleared close to the time that they are set, so the memory-use could be significantly reduced by a solution that frees memory no longer required. I was (very optimistically!) hoping that setting array entries to '' would free the memory, but a bit of profiling shows me that it's not so.

So, what would be a better solution? Here are some ideas that I had:

- splice lines no longer required out of @lines.
Con: I'd have to go through %locks, changing the line numbers each time I remove a line.

- grep backwards through @lines when I find an 'unlocked' message.
Cons: Probably much slower, and I'd no longer pick up on the 'Address already locked' error condition that I find in my solution.

Update:
Sorry, here's some example data:

DEBUG   : MUTEX       : : Mutex(0x30080002 + 0)::locking
DEBUG   : MUTEX       : : Mutex(0x30080002 + 2)::locking
DEBUG   : MUTEX       : : Mutex(0x30080002 + 2)::unlocked
DEBUG   : MUTEX       : : Mutex(0x30080002 + 4)::locking
DEBUG   : MUTEX       : : Mutex(0x30080002 + 2)::locking
DEBUG   : MUTEX       : : Mutex(0x30080002 + 2)::unlocked
DEBUG   : MUTEX       : : Mutex(0x30080002 + 4)::unlocked
In this case, all but the first line should be removed.

Update 2:
Thanks to everyone for your help - learning about '$.' has alone made this all worthwhile for me!

The contributions from hv and meetraz are much more efficient, and run much faster too. Thanks! However, I'll have to ask my colleague if he actually needs to see the lines which weren't removed (which I simplified somewhat in the sample data above - sorry!), or if such a report on the locks will suffice.

Update 3: :(
Merely replacing $count with $. increases my run-time by 30% (65MB testfile, lines stored in a hash, 17 -> 22 seconds)! Who'd have thought?