Re^2: Matching lines in 2+ GB logfiles.

in reply to Re: Matching lines in 2+ GB logfiles.
in thread Matching lines in 2+ GB logfiles.

I am basically looking for something faster than grep. I am being forced to grep these huge maillogs that are around 2.5 GB each and I have 6 of these to search through.

After all this is over, all that will really have mattered is how we treated each other.

Comment on Re^2: Matching lines in 2+ GB logfiles.

Replies are listed 'Best First'.
Re^3: Matching lines in 2+ GB logfiles. by dave_the_m (Monsignor) on May 01, 2008 at 17:55 UTC
I am basically looking for something faster than grep You're unlikely find anything much faster than grep - it's a program a written in C and optimised to scan though a text file printing out lines that match. You may also be running into IO limits. For example, you could try `time wc -l bigfile` [download] Which will effectively give you a lower bound (just read the the contents of the file, and find the \n's). If the grep isn't a whole lot slower than that, then there's probably no way to speed it up. Dave.	[reply] [d/l]
Re^3: Matching lines in 2+ GB logfiles. by samtregar (Abbot) on May 01, 2008 at 18:14 UTC
Hmm, you're looking for something faster than grep and you decided to write it in Perl? That doesn't make a lot of sense to me. Here's an interesting article on trying to beat grep at its own game: http://ridiculousfish.com/blog/archives/2006/05/30/old-age-and-treachery/ -sam	[reply]
Re^3: Matching lines in 2+ GB logfiles. by bluto (Curate) on May 01, 2008 at 19:07 UTC
If you generally know what you are looking for ahead of time, one method is to keep a process always running that tails a log file. This process can then send everything it finds to another file, which can be searched instead. If you need to beat grep, you can, but you have to do things that grep can't. This includes knowing how the files are laid out on disk (esp RAID), and how many CPUs you can take advantage of (i.e. lower transparency to raise performance). You can write a multithreaded (or multiprocess) script that will read through the file at specific offsets in parallel. This may require lots of tweaking though (e.g. performance depends on how the filesystem prefetches data, and what the optimum read size is for your RAID). FWIW, you may want to look around for a multithreaded grep.	[reply]
Re^3: Matching lines in 2+ GB logfiles. by mscharrer (Hermit) on May 01, 2008 at 18:16 UTC
Perl is more flexible and powerful than grep but definitive not faster. Also AFIK grep (or was it egrep?) uses a finite state machine, not a infinite one like perl, so it is much faster, but much less flexible, i.e. doesn't support back-tracking, etc.. Try to optimise your regex to speed things up. In perl you can use `use re 'debug';` to show how many permutations you regex causes.	[reply] [d/l]
Re^4: Matching lines in 2+ GB logfiles. by samtregar (Abbot) on May 01, 2008 at 18:24 UTC
Perl's regular expression engine may be powerful but it doesn't yet use an "infinite" state machine! I think the terms you're looking for are NFA (Nondeterministic Finite Automaton, like Perl) and DFA (Deterministic Finite Automaton, like egrep, sometimes, it's actually a hybrid). -sam	[reply]
Re^5: Matching lines in 2+ GB logfiles. by mscharrer (Hermit) on May 01, 2008 at 18:35 UTC
Yes, you are right, that was exactly what I meant. I confused some terms here. Thanks for pointing this out.	[reply]

In Section Seekers of Perl Wisdom