dbmathis has asked for the wisdom of the Perl Monks concerning the following question:
Hi Everyone,
I have been searching throught this sitre for any tips on matching lines on huge logfiles and I can across the following node. The script in this node works great and it's almost exactly what I need, but it only returns that text that I am searching for. When I modify it to fit my needs it slows down.
Ref
http://www.perlmonks.org/?node_id=128925
#!/usr/bin/perl -w
#
# Proof-of-concept for using minimal memory to search huge
# files, using a sliding window, matching within the window,
# and using on /gc and pos() to restart the search at the
# correct spot whenever we slide the window.
#
# Doesn't correctly handle potential matches that overlap;
# the first fragment that matches wins.
#
use strict;
use constant BLOCKSIZE => (8 * 1024);
&search("bighuge.log",
sub { print $_[0], "\n" },
"<img[^>]*>");
sub search {
my ($file, $callback, @fragments) = @_;
local *F;
open(F, "<", $file) or die "$file: $!";
binmode(F);
# prime the window with two blocks (if possible)
my $nbytes = read(F, my $window, 2 * BLOCKSIZE);
my $re = "(" . join("|", @fragments) . ")";
while ( $nbytes > 0 ) {
# match as many times as we can within the
# window, remembering the position of the
# final match (if any).
while ( $window =~ m/$re/oigcs ) {
&$callback($1);
}
my $pos = pos($window);
# grab the next block
$nbytes = read(F, my $block, BLOCKSIZE);
last if $nbytes == 0;
# slide the window by discarding the initial
# block and appending the next. then reset
# the starting position for matching.
substr($window, 0, BLOCKSIZE) = '';
$window .= $block;
$pos -= BLOCKSIZE;
pos($window) = $pos > 0 ? $pos : 0;
}
close(F);
}
For example the regex search doesn't search by line it searches across the entire block and then prints out matches.
I was searching for e-mail addresses in a 2 GB maillog file and when it finds the e-mail it just spits it out
So I modified:
while ( $window =~ m/$re/oigcs ) {
&$callback($1);
}
To look like this to capture the line (which is what I need):
while ( $window =~ m/\w{3}\s{1,2}\d{1,2}.*$re.*\n/oigc ) {
&$callback($1);
}
And things slowed considerably. It went for 30 secs to several minutes. How should I modify the code above to spit out the line in which the match was found in without slowing down the search time?
Here is a sample of the lines in the file:
Feb 24 04:03:47 server sendmail[]: khdkahsdad876sad8: to=<sample@colle
+geclub.com>, delay=1+13:12:11, xdelay=00:00:00, mail
er=esmtp, pri=25672345, relay=collegeclub.com., dsn=4.0.0, stat=Deferr
+ed: Connection timed out with collegeclub.com.
Feb 24 04:03:47 server sendmail[31356]: madhksadkh5574: to=<sample@iit
+.edu>, delay=1+13:20:32, xdelay=00:00:00, mailer=esmtp,
pri=26574dffd, relay=sample.iit.edu. [006.47.143.000], dsn=4.3.1, sta
+t=Deferred: 452 sample 4.2.1 Mailbox temporarily disabled: sample@iit
+.edu
After all this is over, all that will really have mattered is how we treated each other.
Re: Matching lines in 2+ GB logfiles.
by mscharrer (Hermit) on May 01, 2008 at 16:02 UTC
|
The reason for the slow execution is most likely the use of the two .* in the regex which result in a very high number of checks inside the regex machine. This is difficult to explain as long you don't know what backtracking is and how it works.
For now just try this:
while ( $window =~ m/\w{3}\s{1,2}\d{1,2}([^\n]+)\n/oigc && $1 =~ /$re/
+) {
&$callback($1);
}
Precompiling $re using qr{} is recommended, or use the o option. | [reply] [d/l] [select] |
|
| [reply] |
Re: Matching lines in 2+ GB logfiles.
by linuxer (Curate) on May 01, 2008 at 15:28 UTC
|
while ( $window =~ m/\w{3}\s{1,2}\d{1,2}.*$re.*\n/oigc ) {
you could try
while ( $window =~ m/\w\w\w\s\s?\d\d?.*$re.*\n/iogc ) {
\w\w\w should run faster than \w{3}, same with \d\d? instead of \d{1,2}
Edit: and same with \s\s? vs. \s{1,2}. The direction should be clear.
Edit2: Maybe precompiling the regex with the qr// Operator might give another speedup.
By the way, I can't remember that /c Modifier, what is it for? | [reply] [d/l] [select] |
|
The /c modifier is always used together with the /g modifier and allows continued search after a failed /g match. Normally pos() is reset after a failed match.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] [d/l] [select] |
Re: Matching lines in 2+ GB logfiles.
by NetWallah (Canon) on May 01, 2008 at 16:07 UTC
|
Here is a modified version (with my test parameters - please reset them to match your current ones).
This version adds a SECOND read for the log file, line-at-a-time, trading I/O for CPU, but should still be pretty fast.
It prints out the line number, and a bunch of other diagnostic/unnecessary info for where the match occurred.
#!/usr/bin/perl -w
#
# Proof-of-concept for using minimal memory to search huge
# files, using a sliding window, matching within the window,
# and using on /gc and pos() to restart the search at the
# correct spot whenever we slide the window.
#
# Doesn't correctly handle potential matches that overlap;
# the first fragment that matches wins.
#
use strict;
use constant BLOCKSIZE => 20; ##(8 * 1024);
my @findoffset;
my $file = "ascii-code.htm";
search( $file, #"bighuge.log",
sub { print $_[0], " at offset $_[1]\n"; push @findoffset,$_[1
+]; },
# "<img[^>]*>");
"javasc");
# Re-read file as lines
$_=0 for my ($line,$offset,$prev,$idx);
open(my $F, "<", $file) or die "$file: $!";
while (<$F>){
$line++;
my $len = length($_);
next unless (($offset+=$len) >= $findoffset[$idx]);
print "$line,$offset,$findoffset[$idx],$len:\t$_";
$idx++;
last if $idx > $#findoffset;
}
close ($F);
#------------------------------------------
sub search {
my ($file, $callback, @fragments) = @_;
my $byteoffset = 0;
open(my $F, "<", $file) or die "$file: $!";
binmode($F);
# prime the window with two blocks (if possible)
my $nbytes = read($F, my $window, 2 * BLOCKSIZE);
my $re = "(" . join("|", @fragments) . ")";
while ( $nbytes > 0 ) {
# match as many times as we can within the
# window, remembering the position of the
# final match (if any).
while ( $window =~ m/$re/oigcs ) {
$callback->($1, $byteoffset);
}
my $pos = pos($window);
# grab the next block
$byteoffset += $nbytes;
$nbytes = read($F, my $block, BLOCKSIZE);
last if $nbytes == 0;
# slide the window by discarding the initial
# block and appending the next. then reset
# the starting position for matching.
substr($window, 0, BLOCKSIZE) = '';
$window .= $block;
$pos -= BLOCKSIZE;
pos($window) = $pos > 0 ? $pos : 0;
}
close($F);
}
Update 1: Note - there may be subtle issues (I hate to say bugs) under boundary conditions where multiple matches occur on the same line. Special case code needs to be added to handle these, if tis condition is expected.
Update 2: Thinking about this some more leads me to believe this is not the right way to go about it. It would be a lot more efficient to track newlines on the First read, and buffer/capture/print the lines containing the text right at the spot. In other words, in addition to passing the Matching $1, the search sub should callback with the line of text, in context. There may be an issue requiring more sliding window buffering, in case the "line" is split across buffers.
"How many times do I have to tell you again and again .. not to be repetitive?"
| [reply] [d/l] |
|
| [reply] |
Re: Matching lines in 2+ GB logfiles.
by samtregar (Abbot) on May 01, 2008 at 16:51 UTC
|
On modern hardware 2GB+ isn't really very big. Have you tried just reading it line-by-line with <F>? I don't know what your performance requirements are but most log-parsing jobs aren't terribly performance sensitive.
You might find that you don't have to tune your regex much once you switch to reading line-by-line. That's because each line will be much smaller than 8K, so the penalty for backtracking on a .* will be consequently much smaller.
-sam
| [reply] |
|
| [reply] |
|
I am basically looking for something faster than grep
You're unlikely find anything much faster than grep - it's a program a written in C and optimised to scan though a text file printing out lines that match. You may also be running into IO limits. For example, you could try
time wc -l bigfile
Which will effectively give you a lower bound (just read the the contents of the file, and find the \n's). If the grep isn't a whole lot slower than that, then there's probably no way to speed it up.
Dave. | [reply] [d/l] |
|
| [reply] |
|
If you generally know what you are looking for ahead of time, one method is to keep a process always running that tails a log file. This process can then send everything it finds to another file, which can be searched instead.
If you need to beat grep, you can, but you have to do things that grep can't. This includes knowing how the files are laid out on disk (esp RAID), and how many CPUs you can take advantage of (i.e. lower transparency to raise performance). You can write a multithreaded (or multiprocess) script that will read through the file at specific offsets in parallel. This may require lots of tweaking though (e.g. performance depends on how the filesystem prefetches data, and what the optimum read size is for your RAID). FWIW, you may want to look around for a multithreaded grep.
| [reply] |
|
| [reply] [d/l] |
|
|
Re: Matching lines in 2+ GB logfiles.
by educated_foo (Vicar) on May 01, 2008 at 16:40 UTC
|
Regarding the regex, I would suggest using ^ and $ along with the /m modifier instead of matching for "\n". On a tangential note, this kind of thing is much simpler if you use Sys::Mmap, like in the wide finder benchmark. | [reply] |
Re: Matching lines in 2+ GB logfiles.
by Anonymous Monk on May 02, 2008 at 01:05 UTC
|
Has anyone here who is claiming that perl can't outrun grep actual run the script that I posted here that dws wrote? This dws guy is on to something. I was finally able to modify it to work like grep and it's 14 time faster than grep. I am working with a 484 MB mailog.
This could be more elegant but this my rookie solution..
while ( $window =~ m/([a-zA-Z]{3}\s{1,2}\d{1,2}.*\n)/oigc ) {
$line = $1;
if ( $1 =~ /$re/ ) {
&$callback($line);
}
}
ls -ltrh /var/log/syslog-ng/server2/ | grep maillog.2
-rw-r----- 1 root logs 484M Mar 11 11:13 maillog.2
-rw-r----- 1 root logs 230M Apr 1 04:10 maillog.2.gz
[dmathis@aus02syslog ~]$ date; ./jujuspeed; date
Thu May 1 19:27:57 CDT 2008
Feb 28 09:53:49 exmx2 sendmail[XXXXX]: 8791: to=<hidden@hotmail.com>,
+delay=00:00:01, xdelay=00:00:01, mailer=esmtp, pri=X3604, relay=mx1.h
+otmail.com. [X5.5X.2X5.X], dsn=2.0.0, stat=Sent ( <X4X0399.120421402X
+XXX.JavaMail.root@hidden.com> Queued mail for delivery)
Thu May 1 19:28:10 CDT 2008
Time taken: 13 Seconds
[dmathis@aus02syslog ~]$ date; egrep -i 'hidden@hotmail.com' /var/log/
+syslog-ng/server2/maillog.2; date
Thu May 1 19:28:48 CDT 2008
Feb 28 09:53:49 exmx2 sendmail[XXXXX]: 8791: to=<hidden@hotmail.com>,
+delay=00:00:01, xdelay=00:00:01, mailer=esmtp, pri=X3604, relay=mx1.h
+otmail.com. [X5.5X.2X5.X], dsn=2.0.0, stat=Sent ( <X4X0399.120421402X
+XXX.JavaMail.root@hidden.com> Queued mail for delivery)
Thu May 1 19:31:57 CDT 2008
Time Taken: 189 Seconds
Thanks for all of the help on here. I have learned alot :)
| [reply] [d/l] [select] |
|
while ( $window =~ m/([a-zA-Z]{3}\s{1,2}\d{1,2}.*\n)/oigc ) {
$line = $1;
if ( $1 =~ /$re/ ) {
&$callback($line);
}
}
This is very close to what mscharrer suggested before.
| [reply] [d/l] |
|
| [reply] |
|
|