comment on

Hi Everyone,

I have been searching throught this sitre for any tips on matching lines on huge logfiles and I can across the following node. The script in this node works great and it's almost exactly what I need, but it only returns that text that I am searching for. When I modify it to fit my needs it slows down.

Ref
http://www.perlmonks.org/?node_id=128925

#!/usr/bin/perl -w
#
# Proof-of-concept for using minimal memory to search huge
# files, using a sliding window, matching within the window,
# and using on /gc and pos() to restart the search at the
# correct spot whenever we slide the window.
#
# Doesn't correctly handle potential matches that overlap;
# the first fragment that matches wins.
#

use strict;
use constant BLOCKSIZE => (8 * 1024);

&search("bighuge.log",
        sub { print $_[0], "\n" },
        "<img[^>]*>");

sub search {
    my ($file, $callback, @fragments) = @_;

    local *F;
    open(F, "<", $file) or die "$file: $!";
    binmode(F);

    # prime the window with two blocks (if possible)
    my $nbytes = read(F, my $window, 2 * BLOCKSIZE);

    my $re = "(" . join("|", @fragments) . ")";

    while ( $nbytes > 0 ) {

        # match as many times as we can within the
        # window, remembering the position of the
        # final match (if any).
        while ( $window =~ m/$re/oigcs ) {
            &$callback($1);
        }
        my $pos = pos($window);

        # grab the next block
        $nbytes = read(F, my $block, BLOCKSIZE);
        last if $nbytes == 0;

        # slide the window by discarding the initial
        # block and appending the next. then reset
        # the starting position for matching.
        substr($window, 0, BLOCKSIZE) = '';
        $window .= $block;
        $pos -= BLOCKSIZE;
        pos($window) = $pos > 0 ? $pos : 0;
    }

    close(F);
}
[download]

For example the regex search doesn't search by line it searches across the entire block and then prints out matches.

I was searching for e-mail addresses in a 2 GB maillog file and when it finds the e-mail it just spits it out

So I modified:

while ( $window =~ m/$re/oigcs ) {
            &$callback($1);
        }
[download]

To look like this to capture the line (which is what I need):

while ( $window =~ m/\w{3}\s{1,2}\d{1,2}.*$re.*\n/oigc ) {
            &$callback($1);
        }
[download]

And things slowed considerably. It went for 30 secs to several minutes. How should I modify the code above to spit out the line in which the match was found in without slowing down the search time?

Here is a sample of the lines in the file:

Feb 24 04:03:47 server sendmail[]: khdkahsdad876sad8: to=<sample@colle
+geclub.com>, delay=1+13:12:11, xdelay=00:00:00, mail
er=esmtp, pri=25672345, relay=collegeclub.com., dsn=4.0.0, stat=Deferr
+ed: Connection timed out with collegeclub.com.
Feb 24 04:03:47 server sendmail[31356]: madhksadkh5574: to=<sample@iit
+.edu>, delay=1+13:20:32, xdelay=00:00:00, mailer=esmtp,
 pri=26574dffd, relay=sample.iit.edu. [006.47.143.000], dsn=4.3.1, sta
+t=Deferred: 452 sample 4.2.1 Mailbox temporarily disabled: sample@iit
+.edu
[download]

After all this is over, all that will really have mattered is how we treated each other.

In reply to Matching lines in 2+ GB logfiles. by dbmathis

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


XP is just a number
	PerlMonks