Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Best way to search large files in Perl

by ccmadd (Initiate)
on May 11, 2016 at 20:08 UTC ( [id://1162801]=perlquestion: print w/replies, xml ) Need Help??

ccmadd has asked for the wisdom of the Perl Monks concerning the following question:

At work, we have a large zipped log file (over 500MB) on a Linux server. We have a Perl script that generates a report using information from the log file. I'm new to Perl, I inherited the script. Everything ran ok (processing took about 35 seconds), then we increased volume. Now the same script takes 90 minutes due to the volume of data. I'm using several calls to the Linux grep command from within my script. Is there a faster way to do this using only Perl and not the Linux command, or is this the best way?
Some additional detail...I first get a list of unique things I'm interested in, similar to a product id(list contains about 7000 unique items). Then, I iterate the big log file one time using regex to find lines I'm interested in for gathering additional data about each product id. I write those lines out to a few new (smaller) files. Then, I loop through the product ID list one time, and execute several different grep commands on the new smaller files I created. Again, I'm using the Linux grep command, not Perl grep, like this:

@productArray=`zgrep 'search term' logfile.log.gz | awk -F:: '{print $2}'`;

Thanks

Replies are listed 'Best First'.
Re: Best way to search large files in Perl
by Laurent_R (Canon) on May 11, 2016 at 22:59 UTC
    Again, I'm using the Linux grep command, not Perl grep
    The Unix grep and Perl's grep don't really do the same things anyway, the only thing they have in common is the idea of filtering data. But they don't really filter the same type of data: Unix grep filters lines of a file, Perl's grep filters elements of an array or a list (this might be slightly simplified, but that's the idea).

    Next the conditions you are reporting are no very clear. You've got a script which is apparently processing a 500 MB log in 35 seconds (not bad), and then you say you get to 90 minutes (almost 200 times more), but with no size indication. Is your new file really 200 times larger? In brief, what are the differences between the 35 seconds and the 90 minutes runs?

    Using calls to Linux grep from Perl is usually not considered to be a great idea and may be a problem (each time you do that, you fire a new shell, or two if you pipe commands) or may be totally anecdotic, depending on how many times you do that compared to the data size. And it really depends on how you use your Linux grep. But if you launch the Linux grep 7,000 times over the file, then it is quite likely to be quite inefficient.

    This is even more the case of you uncompress the file for each search. It would almost certainly be better to uncompress the file only once, and then look for your data in the uncompressed version.

    But the bottom line is that we would need much more information to really suggest a better solution. Basically, you should show us the code and a small sample of the data.

Re: Best way to search large files in Perl
by graff (Chancellor) on May 12, 2016 at 02:46 UTC
    Based on this part of your description:

    I first get a list of unique things I'm interested in, similar to a product id(list contains about 7000 unique items). Then, I iterate the big log file one time using regex to find lines I'm interested in for gathering additional data about each product id. I write those lines out to a few new (smaller) files. Then, I loop through the product ID list one time, and execute several different grep commands on the new smaller files I created.

    I can't really tell: (a) how many different primary input files you have, or (b) how many times you read each primary input file from beginning to end. If you are reading a really big file many many times in order to get matches on some number of different patterns, then you might be able to speed things up by doing a slightly more complicated set of matches on a single pass over the data.

    RonW gave you a really useful suggestion: PerlIO::gzip -- I second that. Read the gzip data directly into your perl script so you can use regex matches on each (uncompressed) line, because (1) Perl gives you a lot more power in matching things efficiently and flexibly, (2) you can use the matches directly in your script to build useful data structures, and (3) you save some overhead time by not launching sub-shells to run unix commands.

Re: Best way to search large files in Perl
by RonW (Parson) on May 11, 2016 at 21:07 UTC

    This comes to mind (not tested):

    #!perl use strict; use warnings; use PerlIO::gzip; open FOO, "<:gzip", "file.gz" or die $!; while (<FOO>) { next unless /search term/; my @f = split '::'; push @productArray, $f[1]; }
Re: Best way to search large files in Perl
by Eily (Monsignor) on May 11, 2016 at 20:59 UTC

    Your post isn't very clear, you should write a small example so that we can better understand what data you are processing. See How do I post a question effectively?. And you should give more information about the sizes and complexity you have to deal with (how big is the big file, and how many smaller files do you process?)

    The first thing you need to do is identify which part (or parts) is taking so long. Since you are talking about minutes , you can make a pretty rough measure (in seconds) with code like:

    my $start_time = time; # Big step here print "First step ended at time: ", $start_time - time; # Other big step print "Second step ended at time: ", $start_time - time;
    Try to find out exactly which part is taking most of the time, and then we can focus on that :).

    Edit s/you're/your/ :D

Re: Best way to search large files in Perl
by BillKSmith (Monsignor) on May 11, 2016 at 21:06 UTC

    First, remember that you do not need the fastest possible code. It only has to be "fast enough".

    You may have to try several implementations. For my first pure perl approach, I would create a hash whose keys are the possible product IDs. Read the file one line at a time. Ignore (last if ...;) lines that would not have been kept by your first grep. Extract the product ID from each remaining line (use split, unpack, substr, or regex depending on the format of your data.) Process the line for that ID if the ID exists in the hash.

    Bill
Re: Best way to search large files in Perl
by LanX (Saint) on May 12, 2016 at 02:19 UTC
    I've never used zgrep but I think the file has to be unzipped completely and cached before being processed.

    You are most likely running into some kind of swapping issues because one of the many pipe buffers is exceeded.

    Easiest solution: break up the log into several smaller ones.

    Not a perl question and rather common sense.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1162801]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-19 04:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found