Matching lines in 2+ GB logfiles.

dbmathis has asked for the wisdom of the Perl Monks concerning the following question:

Hi Everyone,

I have been searching throught this sitre for any tips on matching lines on huge logfiles and I can across the following node. The script in this node works great and it's almost exactly what I need, but it only returns that text that I am searching for. When I modify it to fit my needs it slows down.

Ref
http://www.perlmonks.org/?node_id=128925

#!/usr/bin/perl -w
#
# Proof-of-concept for using minimal memory to search huge
# files, using a sliding window, matching within the window,
# and using on /gc and pos() to restart the search at the
# correct spot whenever we slide the window.
#
# Doesn't correctly handle potential matches that overlap;
# the first fragment that matches wins.
#

use strict;
use constant BLOCKSIZE => (8 * 1024);

&search("bighuge.log",
        sub { print $_[0], "\n" },
        "<img[^>]*>");

sub search {
    my ($file, $callback, @fragments) = @_;

    local *F;
    open(F, "<", $file) or die "$file: $!";
    binmode(F);

    # prime the window with two blocks (if possible)
    my $nbytes = read(F, my $window, 2 * BLOCKSIZE);

    my $re = "(" . join("|", @fragments) . ")";

    while ( $nbytes > 0 ) {

        # match as many times as we can within the
        # window, remembering the position of the
        # final match (if any).
        while ( $window =~ m/$re/oigcs ) {
            &$callback($1);
        }
        my $pos = pos($window);

        # grab the next block
        $nbytes = read(F, my $block, BLOCKSIZE);
        last if $nbytes == 0;

        # slide the window by discarding the initial
        # block and appending the next. then reset
        # the starting position for matching.
        substr($window, 0, BLOCKSIZE) = '';
        $window .= $block;
        $pos -= BLOCKSIZE;
        pos($window) = $pos > 0 ? $pos : 0;
    }

    close(F);
}
[download]

For example the regex search doesn't search by line it searches across the entire block and then prints out matches.

I was searching for e-mail addresses in a 2 GB maillog file and when it finds the e-mail it just spits it out

So I modified:

while ( $window =~ m/$re/oigcs ) {
            &$callback($1);
        }
[download]

To look like this to capture the line (which is what I need):

while ( $window =~ m/\w{3}\s{1,2}\d{1,2}.*$re.*\n/oigc ) {
            &$callback($1);
        }
[download]

And things slowed considerably. It went for 30 secs to several minutes. How should I modify the code above to spit out the line in which the match was found in without slowing down the search time?

Here is a sample of the lines in the file:

Feb 24 04:03:47 server sendmail[]: khdkahsdad876sad8: to=<sample@colle
+geclub.com>, delay=1+13:12:11, xdelay=00:00:00, mail
er=esmtp, pri=25672345, relay=collegeclub.com., dsn=4.0.0, stat=Deferr
+ed: Connection timed out with collegeclub.com.
Feb 24 04:03:47 server sendmail[31356]: madhksadkh5574: to=<sample@iit
+.edu>, delay=1+13:20:32, xdelay=00:00:00, mailer=esmtp,
 pri=26574dffd, relay=sample.iit.edu. [006.47.143.000], dsn=4.3.1, sta
+t=Deferred: 452 sample 4.2.1 Mailbox temporarily disabled: sample@iit
+.edu
[download]

After all this is over, all that will really have mattered is how we treated each other.

Comment on Matching lines in 2+ GB logfiles. Select or Download Code

Replies are listed 'Best First'.
Re: Matching lines in 2+ GB logfiles. by mscharrer (Hermit) on May 01, 2008 at 16:02 UTC
The reason for the slow execution is most likely the use of the two `.*` in the regex which result in a very high number of checks inside the regex machine. This is difficult to explain as long you don't know what backtracking is and how it works. For now just try this: `while ( $window =~ m/\w{3}\s{1,2}\d{1,2}([^\n]+)\n/oigc && $1 =~ /$re/ +) { &$callback($1); }` [download] Precompiling $re using qr{} is recommended, or use the `o` option.	[reply] [d/l] [select]
Re^2: Matching lines in 2+ GB logfiles. by dbmathis (Scribe) on May 01, 2008 at 17:10 UTC
I could not get this to work. Would not match anything. After all this is over, all that will really have mattered is how we treated each other.	[reply]
Re: Matching lines in 2+ GB logfiles. by linuxer (Curate) on May 01, 2008 at 15:28 UTC
Just my first thought; so instead of `while ( $window =~ m/\w{3}\s{1,2}\d{1,2}.$re.\n/oigc ) {` [download] you could try `while ( $window =~ m/\w\w\w\s\s?\d\d?.$re.\n/iogc ) {` [download] \w\w\w should run faster than \w{3}, same with \d\d? instead of \d{1,2} Edit: and same with \s\s? vs. \s{1,2}. The direction should be clear. Edit2: Maybe precompiling the regex with the qr// Operator might give another speedup. By the way, I can't remember that /c Modifier, what is it for?	[reply] [d/l] [select]
Re^2: Matching lines in 2+ GB logfiles. by CountZero (Bishop) on May 01, 2008 at 16:25 UTC
The `/c` modifier is always used together with the `/g` modifier and allows continued search after a failed /g match. Normally `pos()` is reset after a failed match. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: Matching lines in 2+ GB logfiles. by NetWallah (Canon) on May 01, 2008 at 16:07 UTC
Here is a modified version (with my test parameters - please reset them to match your current ones). This version adds a SECOND read for the log file, line-at-a-time, trading I/O for CPU, but should still be pretty fast. It prints out the line number, and a bunch of other diagnostic/unnecessary info for where the match occurred. #!/usr/bin/perl -w # # Proof-of-concept for using minimal memory to search huge # files, using a sliding window, matching within the window, # and using on /gc and pos() to restart the search at the # correct spot whenever we slide the window. # # Doesn't correctly handle potential matches that overlap; # the first fragment that matches wins. # use strict; use constant BLOCKSIZE => 20; ##(8 * 1024); my @findoffset; my $file = "ascii-code.htm"; search( $file, #"bighuge.log", sub { print $_[0], " at offset $_[1]\n"; push @findoffset,$_[1 +]; }, # "<img[^>]>"); "javasc"); # Re-read file as lines $_=0 for my ($line,$offset,$prev,$idx); open(my $F, "<", $file) or die "$file: $!"; while (<$F>){ $line++; my $len = length($_); next unless (($offset+=$len) >= $findoffset[$idx]); print "$line,$offset,$findoffset[$idx],$len:\t$_"; $idx++; last if $idx > $#findoffset; } close ($F); #------------------------------------------ sub search { my ($file, $callback, @fragments) = @_; my $byteoffset = 0; open(my $F, "<", $file) or die "$file: $!"; binmode($F); # prime the window with two blocks (if possible) my $nbytes = read($F, my $window, 2 BLOCKSIZE); my $re = "(" . join("\|", @fragments) . ")"; while ( $nbytes > 0 ) { # match as many times as we can within the # window, remembering the position of the # final match (if any). while ( $window =~ m/$re/oigcs ) { $callback->($1, $byteoffset); } my $pos = pos($window); # grab the next block $byteoffset += $nbytes; $nbytes = read($F, my $block, BLOCKSIZE); last if $nbytes == 0; # slide the window by discarding the initial # block and appending the next. then reset # the starting position for matching. substr($window, 0, BLOCKSIZE) = ''; $window .= $block; $pos -= BLOCKSIZE; pos($window) = $pos > 0 ? $pos : 0; } close($F); } [download] Update 1: Note - there may be subtle issues (I hate to say bugs) under boundary conditions where multiple matches occur on the same line. Special case code needs to be added to handle these, if tis condition is expected. Update 2: Thinking about this some more leads me to believe this is not the right way to go about it. It would be a lot more efficient to track newlines on the First read, and buffer/capture/print the lines containing the text right at the spot. In other words, in addition to passing the Matching $1, the search sub should callback with the line of text, in context. There may be an issue requiring more sliding window buffering, in case the "line" is split across buffers. "How many times do I have to tell you again and again .. not to be repetitive?"	[reply] [d/l]
Re^2: Matching lines in 2+ GB logfiles. by dbmathis (Scribe) on May 01, 2008 at 17:13 UTC
This worked but was not any faster than egrep. I may just be stuck waiting 30 minutes for egrep to grep these huge files. After all this is over, all that will really have mattered is how we treated each other.	[reply]
Re: Matching lines in 2+ GB logfiles. by samtregar (Abbot) on May 01, 2008 at 16:51 UTC
On modern hardware 2GB+ isn't really very big. Have you tried just reading it line-by-line with <F>? I don't know what your performance requirements are but most log-parsing jobs aren't terribly performance sensitive. You might find that you don't have to tune your regex much once you switch to reading line-by-line. That's because each line will be much smaller than 8K, so the penalty for backtracking on a .* will be consequently much smaller. -sam	[reply]
Re^2: Matching lines in 2+ GB logfiles. by dbmathis (Scribe) on May 01, 2008 at 17:08 UTC
I am basically looking for something faster than grep. I am being forced to grep these huge maillogs that are around 2.5 GB each and I have 6 of these to search through. After all this is over, all that will really have mattered is how we treated each other.	[reply]
Re^3: Matching lines in 2+ GB logfiles. by dave_the_m (Monsignor) on May 01, 2008 at 17:55 UTC
I am basically looking for something faster than grep You're unlikely find anything much faster than grep - it's a program a written in C and optimised to scan though a text file printing out lines that match. You may also be running into IO limits. For example, you could try `time wc -l bigfile` [download] Which will effectively give you a lower bound (just read the the contents of the file, and find the \n's). If the grep isn't a whole lot slower than that, then there's probably no way to speed it up. Dave.	[reply] [d/l]
Re^3: Matching lines in 2+ GB logfiles. by samtregar (Abbot) on May 01, 2008 at 18:14 UTC
Hmm, you're looking for something faster than grep and you decided to write it in Perl? That doesn't make a lot of sense to me. Here's an interesting article on trying to beat grep at its own game: http://ridiculousfish.com/blog/archives/2006/05/30/old-age-and-treachery/ -sam	[reply]
Re^3: Matching lines in 2+ GB logfiles. by bluto (Curate) on May 01, 2008 at 19:07 UTC
If you generally know what you are looking for ahead of time, one method is to keep a process always running that tails a log file. This process can then send everything it finds to another file, which can be searched instead. If you need to beat grep, you can, but you have to do things that grep can't. This includes knowing how the files are laid out on disk (esp RAID), and how many CPUs you can take advantage of (i.e. lower transparency to raise performance). You can write a multithreaded (or multiprocess) script that will read through the file at specific offsets in parallel. This may require lots of tweaking though (e.g. performance depends on how the filesystem prefetches data, and what the optimum read size is for your RAID). FWIW, you may want to look around for a multithreaded grep.	[reply]
Re^3: Matching lines in 2+ GB logfiles. by mscharrer (Hermit) on May 01, 2008 at 18:16 UTC
Perl is more flexible and powerful than grep but definitive not faster. Also AFIK grep (or was it egrep?) uses a finite state machine, not a infinite one like perl, so it is much faster, but much less flexible, i.e. doesn't support back-tracking, etc.. Try to optimise your regex to speed things up. In perl you can use `use re 'debug';` to show how many permutations you regex causes.	[reply] [d/l]
Re^4: Matching lines in 2+ GB logfiles. by samtregar (Abbot) on May 01, 2008 at 18:24 UTC
Re^5: Matching lines in 2+ GB logfiles. by mscharrer (Hermit) on May 01, 2008 at 18:35 UTC
Re: Matching lines in 2+ GB logfiles. by educated_foo (Vicar) on May 01, 2008 at 16:40 UTC
Regarding the regex, I would suggest using ^ and $ along with the /m modifier instead of matching for "\n". On a tangential note, this kind of thing is much simpler if you use Sys::Mmap, like in the wide finder benchmark.	[reply]
Re: Matching lines in 2+ GB logfiles. by Anonymous Monk on May 02, 2008 at 01:05 UTC
Has anyone here who is claiming that perl can't outrun grep actual run the script that I posted here that dws wrote? This dws guy is on to something. I was finally able to modify it to work like grep and it's 14 time faster than grep. I am working with a 484 MB mailog. This could be more elegant but this my rookie solution.. `while ( $window =~ m/([a-zA-Z]{3}\s{1,2}\d{1,2}.*\n)/oigc ) { $line = $1; if ( $1 =~ /$re/ ) { &$callback($line); } }` [download] ls -ltrh /var/log/syslog-ng/server2/ \| grep maillog.2 -rw-r----- 1 root logs 484M Mar 11 11:13 maillog.2 -rw-r----- 1 root logs 230M Apr 1 04:10 maillog.2.gz [dmathis@aus02syslog ~]$ date; ./jujuspeed; date Thu May 1 19:27:57 CDT 2008 Feb 28 09:53:49 exmx2 sendmail[XXXXX]: 8791: to=<hidden@hotmail.com>, +delay=00:00:01, xdelay=00:00:01, mailer=esmtp, pri=X3604, relay=mx1.h +otmail.com. [X5.5X.2X5.X], dsn=2.0.0, stat=Sent ( <X4X0399.120421402X +XXX.JavaMail.root@hidden.com> Queued mail for delivery) Thu May 1 19:28:10 CDT 2008 Time taken: 13 Seconds [dmathis@aus02syslog ~]$ date; egrep -i 'hidden@hotmail.com' /var/log/ +syslog-ng/server2/maillog.2; date Thu May 1 19:28:48 CDT 2008 Feb 28 09:53:49 exmx2 sendmail[XXXXX]: 8791: to=<hidden@hotmail.com>, +delay=00:00:01, xdelay=00:00:01, mailer=esmtp, pri=X3604, relay=mx1.h +otmail.com. [X5.5X.2X5.X], dsn=2.0.0, stat=Sent ( <X4X0399.120421402X +XXX.JavaMail.root@hidden.com> Queued mail for delivery) Thu May 1 19:31:57 CDT 2008 Time Taken: 189 Seconds [download] Thanks for all of the help on here. I have learned alot :)	[reply] [d/l] [select]
Re^2: Matching lines in 2+ GB logfiles. by alexm (Chaplain) on May 02, 2008 at 11:16 UTC
`while ( $window =~ m/([a-zA-Z]{3}\s{1,2}\d{1,2}.*\n)/oigc ) { $line = $1; if ( $1 =~ /$re/ ) { &$callback($line); } }` [download] This is very close to what mscharrer suggested before.	[reply] [d/l]
Re^3: Matching lines in 2+ GB logfiles. by dbmathis (Scribe) on May 02, 2008 at 13:25 UTC
Indeed! After all this is over, all that will really have mattered is how we treated each other.	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks