http://qs321.pair.com?node_id=128925
Category: Text Processing
Author/Contact Info dws
Description: A demonstration of how to grep through huge files using a sliding window (buffer) technique. The code below has rough edges, but works for simple regular express fragments. Treat it as a starting point.

I've seen this done somewhere before, but couldn't find a working example, so I whipped this one up. A pointer to a more authoritative version will be appreciated.

#!/usr/bin/perl -w
#
# Proof-of-concept for using minimal memory to search huge
# files, using a sliding window, matching within the window,
# and using on /gc and pos() to restart the search at the
# correct spot whenever we slide the window.
#
# Doesn't correctly handle potential matches that overlap;
# the first fragment that matches wins.
#

use strict;
use constant BLOCKSIZE => (8 * 1024);

&search("bighuge.log",
        sub { print $_[0], "\n" },
        "<img[^>]*>");

sub search {
    my ($file, $callback, @fragments) = @_;

    local *F;
    open(F, "<", $file) or die "$file: $!";
    binmode(F);

    # prime the window with two blocks (if possible)
    my $nbytes = read(F, my $window, 2 * BLOCKSIZE);

    my $re = "(" . join("|", @fragments) . ")";

    while ( $nbytes > 0 ) {

        # match as many times as we can within the
        # window, remembering the position of the
        # final match (if any).
        while ( $window =~ m/$re/oigcs ) {
            &$callback($1);
        }
        my $pos = pos($window);

        # grab the next block
        $nbytes = read(F, my $block, BLOCKSIZE);
        last if $nbytes == 0;

        # slide the window by discarding the initial
        # block and appending the next. then reset
        # the starting position for matching.
        substr($window, 0, BLOCKSIZE) = '';
        $window .= $block;
        $pos -= BLOCKSIZE;
        pos($window) = $pos > 0 ? $pos : 0;
    }

    close(F);
}
Replies are listed 'Best First'.
Re: Matching in huge files
by Anonymous Monk on Oct 06, 2004 at 18:06 UTC
    This is very usefull for me, but, how can this be changed to correctly handle potential matches that overlap? Thanks very much

      how can this be changed to correctly handle potential matches that overlap?

      How would you handle overlapping matches in a smaller string? The easiest approach is to apply each pattern to the string in serial. The slightly more complicated fix is to "back up" (via pos()) the length of the match - 1, then proceed with matching all in parallel. Either approach would work here. It comes down to your specific requirements.

Re: Matching in huge files
by Anonymous Monk on Dec 08, 2004 at 16:58 UTC
    dws - this is great - really fast, however, I am having problems when the window happens to split the string you are searching for in two, so can never be found - how can this be fixed? e.g. searching for 'findme' window1 = "blah blah blah fin" window2 = "dme blah blah blah"

      I am having problems when the window happens to split the string you are searching for in two, so can never be found - how can this be fixed?

      The algorithm uses a sliding window, and matches strings that fall within that (sliding) window. If you're trying to match a string that doesn't fit in the window, make the window larger. Or if you think you've found a problem, post a test case that demonstrates the failure.

        Yep, it is very fast, but why is that better than this:
        open(F, "<", $file) or die "$file: $!"; binmode(F); undef $/; # switch off end-of-line separating # read file in large chunks while (<F>) { while ( m/$re/oigsm ) { print "$1\n"; } } $/ = '\n'; # switch back to line mode close(F);

        ?

        Thanks,
        Tamas
        Making the window large only limits the posibilities of having the string you are looking for cut in two, there is no real way to prevent this from happening unless you are looking for a fixed size string. In that case you could always keep that fixed size of the old window and append the new window to the old. This way if there was an intersection you have just undone it, make sure to move your position back accordingly.