http://qs321.pair.com?node_id=527086


in reply to Re: Reducing memory usage while matching log entries
in thread Reducing memory usage while matching log entries

I like this idea but I'm having trouble get a 2 pass solution to work nicely. In the first pass I identify which line numbers need to be removed. But then I'll need to mung that list into some convenient form to use while reading the file again in the second pass...

Tie::File looks quite handy, but since I don't want to alter the original log file, I'd have to copy it first and then reduce it, which could be a problem if disk-space is tight.

So, so far storing the whole file in a hash so that I can properly delete the lines no longer required seems like a winner.

Salva: The 'sort' idea has the problem that a given lock ID can be locked and unlocked multiple times in the file, so the sorted valued won't always be 'locking' follwed by 'unlocked'. Plus the contents of the logfile need to remain in the correct order for analysis of the remaining contents... but thanks!

Replies are listed 'Best First'.
Re^3: Reducing memory usage while matching log entries
by duff (Parson) on Feb 01, 2006 at 23:08 UTC
    I like this idea but I'm having trouble get a 2 pass solution to work nicely. In the first pass I identify which line numbers need to be removed. But then I'll need to mung that list into some convenient form to use while reading the file again in the second pass...

    I don't understand your difficulty. Pass #1 records line numbers, pass #2 writes all the lines that haven't been recorded. Here's some code based on your original but with a few tweaks:

    #!/usr/bin/perl use strict; use warnings; sub Log; die "Usage: $0 <filename>\n" unless @ARGV == 1; my $logfile = shift; # Pass #1 : gather line numbers to be deleted. my %locks; # Hash of currently open locks. my @unlock_lines; # lines to rid ourselves of open(LOGFILE,$logfile) or die "Can't read $logfile - $!\n"; while (<LOGFILE>) { Log 2, "Analysing line $."; next unless /Mutex\((.*?)\)::(\w+)/; my ($address,$action) = ($1,$2); if ($action eq 'locking') { Log 2, "Address $address locked at line $."; if (defined $locks{$address}) { Log 0, "ERROR: Address $address locked at line $., but already l +ocked at line $locks{$address}."; } $locks{$address} = $.; } if ($action eq 'unlocked') { Log 2, "Address $address unlocked at line $."; unless (defined $locks{$address}) { Log 0, "ERROR: Address $address not locked, but unlocked at lin +e $.."; } else { push @unlock_lines, $., delete $locks{$address}; } } } close LOGFILE; # Sort the lines numbers that we've accumulated because we put them in # unordered. This allows us to make just one more pass through the fil +e # to remove the lines. @unlock_lines = sort { $a <=> $b } @unlock_lines; # Pass #2: output all but the lines we're not interested in. my $rmline = shift @unlock_lines; open(LOGFILE,$logfile) or die "Can't read $logfile - $!\n"; while (<LOGFILE>) { if (defined $rmline && $. == $rmline) { $rmline = shift @unlock_lines; next; } print; } close LOGFILE;
          I like this idea but I'm having trouble get a 2 pass solution to work nicely...

        I don't understand your difficulty...

      Sorry, I had to leave work about the time of my last message yesterday (well, actually 10 minutes _before_ my last message!), but didn't want to disappear without writing back. And I'm not good at coding under stress (and not fantastic the rest of the time either!).

      Anyway, thanks for taking the time to write out the code there. Strangely enough, I tried it this morning, and the memory usage of this is actually higher than the original, at least as Linux measures it! I'm processing a 65MB test file - after the first pass the script is consuming 100MB of memory, and after the sort then 200MB!
      With this test file, @unlock_lines ends up with 800000 entries, but still I was surprised. My original script used (RSS & VSZ) 160MB, and using a hash to store the file in memory (so as to properly free the deleted lines) brings it down to 110MB...

      Thanks also for '$.' - I didn't know about that one!

Re^3: Reducing memory usage while matching log entries
by salva (Canon) on Feb 02, 2006 at 10:55 UTC
    The 'sort' idea has the problem that a given lock ID can be locked and unlocked multiple times in the file, so the sorted valued won't always be 'locking' follwed by 'unlocked'.

    That shouldn't be a problem as long as you use an stable sort implementation or alternatively the line number or a timestamp as the secondary sorting key.