Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Re: Matching lines in 2+ GB logfiles.

by NetWallah (Canon)
on May 01, 2008 at 16:07 UTC ( #683958=note: print w/replies, xml ) Need Help??

in reply to Matching lines in 2+ GB logfiles.

Here is a modified version (with my test parameters - please reset them to match your current ones).

This version adds a SECOND read for the log file, line-at-a-time, trading I/O for CPU, but should still be pretty fast.

It prints out the line number, and a bunch of other diagnostic/unnecessary info for where the match occurred.

#!/usr/bin/perl -w # # Proof-of-concept for using minimal memory to search huge # files, using a sliding window, matching within the window, # and using on /gc and pos() to restart the search at the # correct spot whenever we slide the window. # # Doesn't correctly handle potential matches that overlap; # the first fragment that matches wins. # use strict; use constant BLOCKSIZE => 20; ##(8 * 1024); my @findoffset; my $file = "ascii-code.htm"; search( $file, #"bighuge.log", sub { print $_[0], " at offset $_[1]\n"; push @findoffset,$_[1 +]; }, # "<img[^>]*>"); "javasc"); # Re-read file as lines $_=0 for my ($line,$offset,$prev,$idx); open(my $F, "<", $file) or die "$file: $!"; while (<$F>){ $line++; my $len = length($_); next unless (($offset+=$len) >= $findoffset[$idx]); print "$line,$offset,$findoffset[$idx],$len:\t$_"; $idx++; last if $idx > $#findoffset; } close ($F); #------------------------------------------ sub search { my ($file, $callback, @fragments) = @_; my $byteoffset = 0; open(my $F, "<", $file) or die "$file: $!"; binmode($F); # prime the window with two blocks (if possible) my $nbytes = read($F, my $window, 2 * BLOCKSIZE); my $re = "(" . join("|", @fragments) . ")"; while ( $nbytes > 0 ) { # match as many times as we can within the # window, remembering the position of the # final match (if any). while ( $window =~ m/$re/oigcs ) { $callback->($1, $byteoffset); } my $pos = pos($window); # grab the next block $byteoffset += $nbytes; $nbytes = read($F, my $block, BLOCKSIZE); last if $nbytes == 0; # slide the window by discarding the initial # block and appending the next. then reset # the starting position for matching. substr($window, 0, BLOCKSIZE) = ''; $window .= $block; $pos -= BLOCKSIZE; pos($window) = $pos > 0 ? $pos : 0; } close($F); }
Update 1: Note - there may be subtle issues (I hate to say bugs) under boundary conditions where multiple matches occur on the same line. Special case code needs to be added to handle these, if tis condition is expected.

Update 2: Thinking about this some more leads me to believe this is not the right way to go about it. It would be a lot more efficient to track newlines on the First read, and buffer/capture/print the lines containing the text right at the spot.

In other words, in addition to passing the Matching $1, the search sub should callback with the line of text, in context. There may be an issue requiring more sliding window buffering, in case the "line" is split across buffers.

     "How many times do I have to tell you again and again .. not to be repetitive?"

Replies are listed 'Best First'.
Re^2: Matching lines in 2+ GB logfiles.
by dbmathis (Scribe) on May 01, 2008 at 17:13 UTC

    This worked but was not any faster than egrep. I may just be stuck waiting 30 minutes for egrep to grep these huge files.

    After all this is over, all that will really have mattered is how we treated each other.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://683958]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2022-01-21 20:07 GMT
Find Nodes?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:

    Results (59 votes). Check out past polls.