Best way to search large files in Perl

ccmadd has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Best way to search large files in Perl by Laurent_R (Canon) on May 11, 2016 at 22:59 UTC
Again, I'm using the Linux grep command, not Perl grep The Unix `grep` and Perl's `grep` don't really do the same things anyway, the only thing they have in common is the idea of filtering data. But they don't really filter the same type of data: Unix grep filters lines of a file, Perl's grep filters elements of an array or a list (this might be slightly simplified, but that's the idea). Next the conditions you are reporting are no very clear. You've got a script which is apparently processing a 500 MB log in 35 seconds (not bad), and then you say you get to 90 minutes (almost 200 times more), but with no size indication. Is your new file really 200 times larger? In brief, what are the differences between the 35 seconds and the 90 minutes runs? Using calls to Linux grep from Perl is usually not considered to be a great idea and may be a problem (each time you do that, you fire a new shell, or two if you pipe commands) or may be totally anecdotic, depending on how many times you do that compared to the data size. And it really depends on how you use your Linux grep. But if you launch the Linux grep 7,000 times over the file, then it is quite likely to be quite inefficient. This is even more the case of you uncompress the file for each search. It would almost certainly be better to uncompress the file only once, and then look for your data in the uncompressed version. But the bottom line is that we would need much more information to really suggest a better solution. Basically, you should show us the code and a small sample of the data.	[reply] [d/l] [select]
Re: Best way to search large files in Perl by graff (Chancellor) on May 12, 2016 at 02:46 UTC
Based on this part of your description: I first get a list of unique things I'm interested in, similar to a product id(list contains about 7000 unique items). Then, I iterate the big log file one time using regex to find lines I'm interested in for gathering additional data about each product id. I write those lines out to a few new (smaller) files. Then, I loop through the product ID list one time, and execute several different grep commands on the new smaller files I created. I can't really tell: (a) how many different primary input files you have, or (b) how many times you read each primary input file from beginning to end. If you are reading a really big file many many times in order to get matches on some number of different patterns, then you might be able to speed things up by doing a slightly more complicated set of matches on a single pass over the data. RonW gave you a really useful suggestion: PerlIO::gzip -- I second that. Read the gzip data directly into your perl script so you can use regex matches on each (uncompressed) line, because (1) Perl gives you a lot more power in matching things efficiently and flexibly, (2) you can use the matches directly in your script to build useful data structures, and (3) you save some overhead time by not launching sub-shells to run unix commands.	[reply]
Re: Best way to search large files in Perl by RonW (Parson) on May 11, 2016 at 21:07 UTC
This comes to mind (not tested): `#!perl use strict; use warnings; use PerlIO::gzip; open FOO, "<:gzip", "file.gz" or die $!; while (<FOO>) { next unless /search term/; my @f = split '::'; push @productArray, $f[1]; }` [download]	[reply] [d/l]
Re: Best way to search large files in Perl by Eily (Monsignor) on May 11, 2016 at 20:59 UTC
Your post isn't very clear, you should write a small example so that we can better understand what data you are processing. See How do I post a question effectively?. And you should give more information about the sizes and complexity you have to deal with (how big is the big file, and how many smaller files do you process?) The first thing you need to do is identify which part (or parts) is taking so long. Since you are talking about minutes , you can make a pretty rough measure (in seconds) with code like: `my $start_time = time; # Big step here print "First step ended at time: ", $start_time - time; # Other big step print "Second step ended at time: ", $start_time - time;` [download] Try to find out exactly which part is taking most of the time, and then we can focus on that :). Edit s/you're/your/ :D	[reply] [d/l]
Re: Best way to search large files in Perl by BillKSmith (Monsignor) on May 11, 2016 at 21:06 UTC
First, remember that you do not need the fastest possible code. It only has to be "fast enough". You may have to try several implementations. For my first pure perl approach, I would create a hash whose keys are the possible product IDs. Read the file one line at a time. Ignore (last if ...;) lines that would not have been kept by your first grep. Extract the product ID from each remaining line (use split, unpack, substr, or regex depending on the format of your data.) Process the line for that ID if the ID exists in the hash. Bill	[reply]
Re: Best way to search large files in Perl by LanX (Saint) on May 12, 2016 at 02:19 UTC
I've never used zgrep but I think the file has to be unzipped completely and cached before being processed. You are most likely running into some kind of swapping issues because one of the many pipe buffers is exceeded. Easiest solution: break up the log into several smaller ones. Not a perl question and rather common sense. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks