http://qs321.pair.com?node_id=106904

claree0 has asked for the wisdom of the Perl Monks concerning the following question:

I'm currently performing a lot of manipulation of log files, which grow by up to 100MB/day. The records in the files are of the format

date/time logentry

and I want to be able to trim out the older records. Currently, I do this by reading the file line-by-line, and writing out the lines after $arbitrary_data to a temporary file, which is then used to replace the current log file.

This takes a considerable amount of time to process, so any suggestions for a more efficient method would be very gratefully received.

Replies are listed 'Best First'.
Halve the difference
by tachyon (Chancellor) on Aug 22, 2001 at 15:39 UTC

    Further to my last post here is an implemetation of the halve the difference method. Uncomment the 3 lines at the start to generate an 8MB test.file. We find our desired reference point in 20 tries (worst case scenario) in a few milliseconds and then dump the rest of the file (3 lines). Assuming you are going to have to work with dates you will need to modify this of course so you can compare if you are before or after your desired start but the principle holds. The total run time should be only a fraction over the time it take to write your output file. Rename it and you are done. You *will* get an infinite loop if your $find_this is not in the file so we abort if $count > $max_tries. With $maxtries set to 100 you are ok for a file with up to 2**100 lines (10**30 in rough terms :-)

    my $file = 'c:/test.file'; #open F, ">$file" or $!; #print F "$_\n" for 1..1000000; #exit; my $find_this = 999997; my $file_size = -s $file; my $top = 0; my $bot = $file_size; my $count = 0; my $max_tries = 100; open OLD, $file or die $!; while (++$count) { my $middle = int(($top+$bot)/2); seek OLD, $middle , 0; my $partial = <OLD>; my $full_line = <OLD>; chomp $full_line; if ($full_line eq $find_this) { print "Took $count tries\n"; print while <OLD>; exit; } if ($full_line < $find_this) { $top = $middle; } else { $bot = $middle; } die "Ark, baling out of infinite loop" if $count > $max_tries; }

    Let us know how you get on.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      Well, I've made some mods to your sample code, and taken the trim time on my sample file from 2m26s to 0.4 seconds. Wow!

      In the code below, I haven't included the subroutine to calculate the epoch-second date of each line 'cos it's longer htan the rest of the file!

      Thank you, Tachyon!

      #!/usr/local/perl -w use strict; my $file ='current.txt'; my $daystokeep = $ARGV[0]; my $secs_to_keep = $daystokeep * 3600 * 24; my $now = time(); my $earliest = $now - $secs_to_keep; my $file_size = -s $file; my $top = 0; my $bottom = $file_size; my $count = 0; my $max_tries = 100; open (OLD, "$file") or die $!; open (NEW, ">new.txt") or die $!; while (++$count) { my $middle = int (($top + $bottom) / 2); seek OLD, $middle, 0; my $partial = <OLD>; my $full = <OLD>; my $next = <OLD>; if (((linesecs($full)) < $earliest) && ((linesecs($next)) > $e +arliest)) { print NEW $next; print NEW while <OLD>; exit; } if ((linesecs($full)) < $earliest) { $top = $middle; } else { $bottom = $middle; } } close OLD; close NEW;

        Wow, 36500% faster. That's a worthwhile saving. Glad it helped. It's always good to use a geometric search rather than a linear one when you have any form of sorted data that you can use the split the dif algoritm on. The number of tries to find the desired position is given by:

        print "Num items Geom avg Lin avg Lin:Geom\n"; for ( my $num_items = 2; $num_items < 2<<20; $num_items <<= 1 ) { # geometric my $geom_max = int(log($num_items)/log(2))+1; my $geom_avg = int(log(($num_items/2))/log(2))+1; # linear my $lin_max = $num_items; my $lin_avg = $num_items/2; printf "%8d %8d %8d %8d\n", $num_items, $geom_avg, $lin_avg, $lin_avg/$geom_avg; }
        Should wrap it in a module one day :-)

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      Very clever, tachyon ++. I first thought "How do you do binary search in a file with variable-length records?" Your answer is simple and effective.

      Please consider linking to this from Tutorials.

        I have wrapped the concept in a module with a variety of useful widgets. It's at File::Seek

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Removing old records from log files
by tachyon (Chancellor) on Aug 22, 2001 at 14:56 UTC

    The general method is as you state:

    my $find_this = qr/$arbitray data/; open OLD, $log or die $!; open NEW, ">$new_log" or die $!; while (<OLD>) { next unless /$find_this/; print NEW $_ while <OLD>; } close OLD; close NEW; rename $new_log, $log or die $!;

    With this code we use the quote regex operator to only compile our regex once to speed it up. We next away until we find our condition and then dump the rest of the file using another while so we never re-enter out outer loop.

    Assuming you are using this you have some options:

    First you could set up a cron to automatically rename your logfile every X hours - presumably you are just grabbing the last X days/hour/minutes of data so this save you the need.

    Alternatively you could check the file size every X hours and record the number. You then use seek LOG, $position, 0 to blast straight to your starting point. The first line will be a partial (probably) so you need to consider that.

    As yet another option you can use seek LOG, $offset, $whence If whence is 0 then a positive offset of x gets you x bytes into the file immediately. If $whence is 2 the a negative offset -x gets you to -x bytes from the end. Either way you read in a full line and then use the old halving the difference to find you start point. It's like the guess a number between 1 and 128. You start at 64 higher 96 higher 112 lower 104 higher 108 higher 110 higher - tada must be 111. Thus we have found the number in 7 tries - a big gain on trying all 128 options. The bigger the file the better this will work as it is a geometrical method of finding the start rather than a linear one. The efficiency gain will be highest if the bit you want is a small % of the total file as you still have to write out the new file.

    Often when people post their real code we can offer suggestion that increase speed significantly. Hint {smile}

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Removing old records from log files
by idnopheq (Chaplain) on Aug 22, 2001 at 14:19 UTC
    Is there a specific reason you want to edit/truncate it in place? Seems like it maybe a lot of work and risky at that. It may be easier to simply rotate the log throughout the day, compressing the older ones to save space. Have you checked out rotating files, is there a better way?? Are these files flat text, or something else? A search on google may be a good idea. I seem to run across log rotation scripts every few weeks.

    NON-perl: I recall in *nix copying the whole file into a new one (filename.0), then something like echo /dev/null > filename. Finally, I'd gzip or bzip2 the logfile.0. YMMV.

    HTH
    --
    idnopheq
    Apply yourself to new problems without preparation, develop confidence in your ability to to meet situations as they arrise.

      This log file is in fact an extract of a much larger log file. I do the extraction so that the bare minimum (!) is kept in this file so that it is available to other processes, and archive the originals. It is not possible to rotate this 'extract' log file, but as it updates every 6 hours I do not run into problems with file locking while I trim it, as I ensure that the two do not overlap.

Re: Removing old records from log files
by Zaxo (Archbishop) on Aug 22, 2001 at 14:21 UTC

    With that rate of logging, you are likely to need locks the logger respects to modify the file in place. Given that, a binary search strategy will speed your search for the cutoff time. RecursivelyIteratively estimate the location of the first record you want to keep.

    Another approach would be to set up a cron job renaming the log file out of the way at set times. You may have a system utility called logrotate which specializes in that.

    Update: Oops :-)

    After Compline,
    Zaxo

      Luckily, locks are not an issue (see above). The binary searchstrategy, however, does sound like a good idea!

Re: Removing old records from log files
by BazB (Priest) on Jan 16, 2002 at 20:44 UTC

    Not to put a stopper on some rather nice code, but if you're running a UNIX system (which I suspect you are), what about just using the logrotate command?

    Rotate files on a hourly basis and be done with it! You can delete the old records, or just store them on disk.

    If you're on Windows, Perl might be easier.

    TMTOMTDI :-)

    Cheers.

    BazB.

Re: Removing old records from log files
by John M. Dlugosz (Monsignor) on Aug 23, 2001 at 10:41 UTC
    Here's an orthogonal idea: when you excerpt the file, store the results backwards. Then you can delete oldest by truncating the file.