http://qs321.pair.com?node_id=683970


in reply to Re: Matching lines in 2+ GB logfiles.
in thread Matching lines in 2+ GB logfiles.

I am basically looking for something faster than grep. I am being forced to grep these huge maillogs that are around 2.5 GB each and I have 6 of these to search through.

After all this is over, all that will really have mattered is how we treated each other.
  • Comment on Re^2: Matching lines in 2+ GB logfiles.

Replies are listed 'Best First'.
Re^3: Matching lines in 2+ GB logfiles.
by dave_the_m (Monsignor) on May 01, 2008 at 17:55 UTC
    I am basically looking for something faster than grep
    You're unlikely find anything much faster than grep - it's a program a written in C and optimised to scan though a text file printing out lines that match. You may also be running into IO limits. For example, you could try
    time wc -l bigfile
    Which will effectively give you a lower bound (just read the the contents of the file, and find the \n's). If the grep isn't a whole lot slower than that, then there's probably no way to speed it up.

    Dave.

Re^3: Matching lines in 2+ GB logfiles.
by samtregar (Abbot) on May 01, 2008 at 18:14 UTC
Re^3: Matching lines in 2+ GB logfiles.
by bluto (Curate) on May 01, 2008 at 19:07 UTC
    If you generally know what you are looking for ahead of time, one method is to keep a process always running that tails a log file. This process can then send everything it finds to another file, which can be searched instead.

    If you need to beat grep, you can, but you have to do things that grep can't. This includes knowing how the files are laid out on disk (esp RAID), and how many CPUs you can take advantage of (i.e. lower transparency to raise performance). You can write a multithreaded (or multiprocess) script that will read through the file at specific offsets in parallel. This may require lots of tweaking though (e.g. performance depends on how the filesystem prefetches data, and what the optimum read size is for your RAID). FWIW, you may want to look around for a multithreaded grep.

Re^3: Matching lines in 2+ GB logfiles.
by mscharrer (Hermit) on May 01, 2008 at 18:16 UTC
    Perl is more flexible and powerful than grep but definitive not faster. Also AFIK grep (or was it egrep?) uses a finite state machine, not a infinite one like perl, so it is much faster, but much less flexible, i.e. doesn't support back-tracking, etc..

    Try to optimise your regex to speed things up. In perl you can use use re 'debug'; to show how many permutations you regex causes.

      Perl's regular expression engine may be powerful but it doesn't yet use an "infinite" state machine! I think the terms you're looking for are NFA (Nondeterministic Finite Automaton, like Perl) and DFA (Deterministic Finite Automaton, like egrep, sometimes, it's actually a hybrid).

      -sam

        Yes, you are right, that was exactly what I meant. I confused some terms here. Thanks for pointing this out.