http://qs321.pair.com?node_id=683967


in reply to Matching lines in 2+ GB logfiles.

On modern hardware 2GB+ isn't really very big. Have you tried just reading it line-by-line with <F>? I don't know what your performance requirements are but most log-parsing jobs aren't terribly performance sensitive.

You might find that you don't have to tune your regex much once you switch to reading line-by-line. That's because each line will be much smaller than 8K, so the penalty for backtracking on a .* will be consequently much smaller.

-sam

Replies are listed 'Best First'.
Re^2: Matching lines in 2+ GB logfiles.
by dbmathis (Scribe) on May 01, 2008 at 17:08 UTC

    I am basically looking for something faster than grep. I am being forced to grep these huge maillogs that are around 2.5 GB each and I have 6 of these to search through.

    After all this is over, all that will really have mattered is how we treated each other.
      I am basically looking for something faster than grep
      You're unlikely find anything much faster than grep - it's a program a written in C and optimised to scan though a text file printing out lines that match. You may also be running into IO limits. For example, you could try
      time wc -l bigfile
      Which will effectively give you a lower bound (just read the the contents of the file, and find the \n's). If the grep isn't a whole lot slower than that, then there's probably no way to speed it up.

      Dave.

      If you generally know what you are looking for ahead of time, one method is to keep a process always running that tails a log file. This process can then send everything it finds to another file, which can be searched instead.

      If you need to beat grep, you can, but you have to do things that grep can't. This includes knowing how the files are laid out on disk (esp RAID), and how many CPUs you can take advantage of (i.e. lower transparency to raise performance). You can write a multithreaded (or multiprocess) script that will read through the file at specific offsets in parallel. This may require lots of tweaking though (e.g. performance depends on how the filesystem prefetches data, and what the optimum read size is for your RAID). FWIW, you may want to look around for a multithreaded grep.

      Perl is more flexible and powerful than grep but definitive not faster. Also AFIK grep (or was it egrep?) uses a finite state machine, not a infinite one like perl, so it is much faster, but much less flexible, i.e. doesn't support back-tracking, etc..

      Try to optimise your regex to speed things up. In perl you can use use re 'debug'; to show how many permutations you regex causes.

        Perl's regular expression engine may be powerful but it doesn't yet use an "infinite" state machine! I think the terms you're looking for are NFA (Nondeterministic Finite Automaton, like Perl) and DFA (Deterministic Finite Automaton, like egrep, sometimes, it's actually a hybrid).

        -sam