comment on

Here is a modified version (with my test parameters - please reset them to match your current ones).

This version adds a SECOND read for the log file, line-at-a-time, trading I/O for CPU, but should still be pretty fast.

It prints out the line number, and a bunch of other diagnostic/unnecessary info for where the match occurred.

#!/usr/bin/perl -w
#
# Proof-of-concept for using minimal memory to search huge
# files, using a sliding window, matching within the window,
# and using on /gc and pos() to restart the search at the
# correct spot whenever we slide the window.
#
# Doesn't correctly handle potential matches that overlap;
# the first fragment that matches wins.
#

use strict;
use constant BLOCKSIZE => 20; ##(8 * 1024);

my @findoffset;
my $file =  "ascii-code.htm";
search( $file, #"bighuge.log",
        sub { print $_[0], " at offset $_[1]\n"; push @findoffset,$_[1
+]; },
       # "<img[^>]*>");
       "javasc");
       
# Re-read file as lines
$_=0 for my ($line,$offset,$prev,$idx);
open(my $F, "<", $file) or die "$file: $!";
while (<$F>){
   $line++;
   my $len = length($_);
   next unless (($offset+=$len) >= $findoffset[$idx]);
   print "$line,$offset,$findoffset[$idx],$len:\t$_";
   $idx++;
   last if $idx > $#findoffset;
}
close ($F);

#------------------------------------------
sub search {
    my ($file, $callback, @fragments) = @_;

    my $byteoffset = 0;
    
    open(my $F, "<", $file) or die "$file: $!";
    binmode($F);

    # prime the window with two blocks (if possible)
    my $nbytes = read($F, my $window, 2 * BLOCKSIZE);

    my $re = "(" . join("|", @fragments) . ")";

    while ( $nbytes > 0 ) {

        # match as many times as we can within the
        # window, remembering the position of the
        # final match (if any).
        while ( $window =~ m/$re/oigcs ) {
            $callback->($1, $byteoffset);
        }
        my $pos = pos($window);

        # grab the next block
        $byteoffset += $nbytes; 
        $nbytes = read($F, my $block, BLOCKSIZE);
        last if $nbytes == 0;

        # slide the window by discarding the initial
        # block and appending the next. then reset
        # the starting position for matching.
        substr($window, 0, BLOCKSIZE) = '';
        $window .= $block;
        $pos -= BLOCKSIZE;
        pos($window) = $pos > 0 ? $pos : 0;
    }

    close($F);
}
[download]

Update 1: Note - there may be subtle issues (I hate to say bugs) under boundary conditions where multiple matches occur on the same line. Special case code needs to be added to handle these, if tis condition is expected.

Update 2: Thinking about this some more leads me to believe this is not the right way to go about it. It would be a lot more efficient to track newlines on the First read, and buffer/capture/print the lines containing the text right at the spot.

In other words, in addition to passing the Matching $1, the search sub should callback with the line of text, in context. There may be an issue requiring more sliding window buffering, in case the "line" is split across buffers.

"How many times do I have to tell you again and again .. not to be repetitive?"

In reply to Re: Matching lines in 2+ GB logfiles. by NetWallah
in thread Matching lines in 2+ GB logfiles. by dbmathis

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


more useful options
	PerlMonks