http://qs321.pair.com?node_id=11110379

beherasan has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to match a set of small sequences (~ 7000 in number) to another large set of sequences stored in hash (size is ~100GB). As this is a huge task, I thought of using multi-threading and search 40 sequences at a time. Although the script is able to create threads, I faced problem in RAM usage when 40 threads are generated. It seems like each thread is taking a copy of hash (and each hash is ~100 GB), due to this the program get killed in between. I am using a system with 512 GB of RAM and 88 threads.

Is there any way of doing this in memory efficient way.

Thank you,
Santosh

#!/usr/bin/perl use strict; use threads; my $num_of_threads = 40; my @peptides=(); ## Store the peptides to search in array open(IN,"peptides.txt") or die "Could not open the file:$!\n"; while(<IN>) { chomp; $_=~s/\r//g; push(@peptides,$_); } close IN; my %hashNR=(); ## Store the Sequence in this hash my %hashRes=(); ## Store the matched results my $nrid=""; open(REF,"NR.fasta") or die "Could not open the file:$!\n"; while(<REF>) { chomp; $_=~s/\r//g; if(/^>/) { $nrid=(split /\s/)[0]; } else { $hashNR{$nrid}.="$_"; } } close REF; my @allIDS=(keys %hashNR); my $L = scalar(@allIDS); print "Reference Reading Completed\n"; my $j= 0; while($j < scalar(@peptides)) { my @threads = initThreads(); foreach(@threads) { my $pep = $peptides[$j]; $_ = threads -> create(\&doOperation,$pep,$L); $j++; } foreach(@threads) { $_ -> join(); } } open(OUT,">Outfile.txt") or die "Could not create the file:$!\n"; foreach my $k (keys %hashRes) { print "$k\t$hashRes{$k}\n"; } close OUT; ############################################### ## Subroutine for initializing the Thread array sub initThreads { my @initThreads; for(my $i=1;$i<=$num_of_threads;$i++) { push(@initThreads,$i); } return @initThreads; } ## Task run by each threads sub doOperation { my @allp = @_; my $id = threads -> tid(); for (my $i=0; $i<$allp[1]; $i++) { if($hashNR{$allIDS[$i]}=~/$allp[0]/) { $hashRes{$allp[0]}.=",$allIDS[$i]"; } } threads -> exit(); }