http://qs321.pair.com?node_id=722991

lukka has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am a newbie to perl. I need to do something like the unix grep in perl. say i have a large file BUFFER.dat (which has 322129 lines) I have another large array $validbufs (which can have say 200000 elements). What i need to do is print out the lines in BUFFER.dat that have the first column (in that line) exactly matching any element in the array $validbufs. The problem is it seems to take way too much time to do this look up. I am only indicating a few lines of the code here, which i think are the problematic sections that take too much time
my %linecontainsbuf=(); while ($line = <BUFFER>) { @fields=split /\'/,$line; $searchfield=$fields[1]; $linecontainsbuf{$searchfield} = $line for @validbufs; } foreach $validbuf (@validbufs) { print $linecontainsbuf{$validbuf}; };
The problem is it seems to work ok if say file/array sizes are small. Say BUFFER.dat has 10000 lines and say validbufs has 28 elements, then it finishes in 2 minutes. However as soon as BUFFER.dat has large number of lines (e.g 322129 lines) and @validbufs has 200000 elements, then it seems to take hours!! Please note: @validbufs is a unique list of strings. and in BUFFER.dat also the first column is always unique. There are no duplicates in the input data (both in the BUFFER.dat and in @validbufs). @validbufs can have number of elements varying between 200000 to say 50. So essentially if @validbufs has say 50 elements, then the script should just print out the 50 lines in BUFFER.dat which match the elements in @validbufs. If @validbufs has say 200000 elements, then the script should print out the 200000 lines in BUFFER.dat that match the elements in @validbufs. I tried splitting the big file BUFFER.dat in lines of 1000 each and doing the lookup on the split files, but even that seems very slow (takes hours). Can you please suggest what is a fast way to do this lookup?