Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^2: Matching lines in 2+ GB logfiles.

by dbmathis (Scribe)
on May 01, 2008 at 17:08 UTC ( #683970=note: print w/replies, xml ) Need Help??


in reply to Re: Matching lines in 2+ GB logfiles.
in thread Matching lines in 2+ GB logfiles.

I am basically looking for something faster than grep. I am being forced to grep these huge maillogs that are around 2.5 GB each and I have 6 of these to search through.

After all this is over, all that will really have mattered is how we treated each other.
  • Comment on Re^2: Matching lines in 2+ GB logfiles.

Replies are listed 'Best First'.
Re^3: Matching lines in 2+ GB logfiles.
by dave_the_m (Monsignor) on May 01, 2008 at 17:55 UTC
    I am basically looking for something faster than grep
    You're unlikely find anything much faster than grep - it's a program a written in C and optimised to scan though a text file printing out lines that match. You may also be running into IO limits. For example, you could try
    time wc -l bigfile
    Which will effectively give you a lower bound (just read the the contents of the file, and find the \n's). If the grep isn't a whole lot slower than that, then there's probably no way to speed it up.

    Dave.

Re^3: Matching lines in 2+ GB logfiles.
by samtregar (Abbot) on May 01, 2008 at 18:14 UTC
Re^3: Matching lines in 2+ GB logfiles.
by bluto (Curate) on May 01, 2008 at 19:07 UTC
    If you generally know what you are looking for ahead of time, one method is to keep a process always running that tails a log file. This process can then send everything it finds to another file, which can be searched instead.

    If you need to beat grep, you can, but you have to do things that grep can't. This includes knowing how the files are laid out on disk (esp RAID), and how many CPUs you can take advantage of (i.e. lower transparency to raise performance). You can write a multithreaded (or multiprocess) script that will read through the file at specific offsets in parallel. This may require lots of tweaking though (e.g. performance depends on how the filesystem prefetches data, and what the optimum read size is for your RAID). FWIW, you may want to look around for a multithreaded grep.

Re^3: Matching lines in 2+ GB logfiles.
by mscharrer (Hermit) on May 01, 2008 at 18:16 UTC
    Perl is more flexible and powerful than grep but definitive not faster. Also AFIK grep (or was it egrep?) uses a finite state machine, not a infinite one like perl, so it is much faster, but much less flexible, i.e. doesn't support back-tracking, etc..

    Try to optimise your regex to speed things up. In perl you can use use re 'debug'; to show how many permutations you regex causes.

      Perl's regular expression engine may be powerful but it doesn't yet use an "infinite" state machine! I think the terms you're looking for are NFA (Nondeterministic Finite Automaton, like Perl) and DFA (Deterministic Finite Automaton, like egrep, sometimes, it's actually a hybrid).

      -sam

        Yes, you are right, that was exactly what I meant. I confused some terms here. Thanks for pointing this out.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://683970]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (5)
As of 2021-12-05 02:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (31 votes). Check out past polls.

    Notices?