I have to find all ready that match with no mismatch
If you're only looking for exact matches, something like this does the job in ~ 7 seconds (including building the index):
#! perl -slw
use strict;
use Time::HiRes qw[ time ];
our $MIN //= 5;
our $SHORT //= 20;
our $NSHORTS //= 10e3;
our $BIG //= 5e6;
our $S //= 0;
$S and srand( $S );
sub rndDna{ my $dna =''; $dna .= ( qw[A C G T] )[ rand 4 ] for 1 .. $_
+[0]; return $dna; }
my $bigDna = rndDna( $BIG );
my @shorts = map rndDna( $SHORT ), 1 .. $NSHORTS;
my $start = time;
my %idx; push @{ $idx{ substr $bigDna, $_, $MIN } }, $_ for 0 .. lengt
+h( $bigDna ) - $MIN;
printf STDERR "Indexing took %.9f secs\n", time() -$start ;
$start = time;
my $found = 0;
for my $short ( @shorts ) {
for my $pos ( @{ $idx{ my $pre = substr $short, 0, $MIN } } ) {
if( substr( $bigDna, $pos, $SHORT ) eq $short ) {
print "$short found at $pos";
++$found;
}
}
}
print STDERR "Found: ", $found;
printf STDERR "Lookup of $NSHORTS $SHORT-nt short reads against a $BIG
+-nt sequence took %.9f secs\n", time() -$start ;
__END__
C:\test>1120869-2 -BIG=5e6 -MIN=5 -SHORT=14 -NSHORTS=10000 -S=1 | wc -
+l
Indexing took 3.365217924 secs
Found: 182
Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 25.56
+8705082 secs
182
C:\test>1120869-2 -BIG=5e6 -MIN=6 -SHORT=14 -NSHORTS=10000 -S=1 | wc -
+l
Indexing took 3.351562023 secs
Found: 182
Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 6.666
+512966 secs
182
C:\test>1120869-2 -BIG=5e6 -MIN=7 -SHORT=14 -NSHORTS=10000 -S=1 | wc -
+l
Indexing took 4.609364986 secs
Found: 182
Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 1.699
+807882 secs
182
C:\test>1120869-2 -BIG=5e6 -MIN=7 -SHORT=14 -NSHORTS=10000 -S=1 | wc -
+l
Indexing took 4.519516945 secs
Found: 182
Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 1.682
+117939 secs
182
C:\test>1120869-2 -BIG=5e6 -MIN=8 -SHORT=14 -NSHORTS=10000 -S=1 | wc -
+l
Indexing took 6.175763845 secs
Found: 182
Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 0.424
+978018 secs
182
C:\test>1120869-2 -BIG=5e6 -MIN=10 -SHORT=14 -NSHORTS=10000 -S=1 | wc
+-l
Indexing took 9.214743853 secs
Found: 182
Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 0.058
+255196 secs
182
Note: I'm using 14nt short reads because any longer and none are found in my randomly generated test sequence.
You can reduce the lookup time by increasing number of characters you index; with the trade off that the indexing take longer and uses more memory: 1 x 5e6nt sequence; 10e3 x 14nt short reads;
Index prefix: indexing(secs): lookup(secs): found: memory(MB):
5 3.37 25.57 182 200
6 3.38 6.66 182 210
7 4.06 1.69 182 230
8 6.17 0.42 182 244
10 9.21 0.05 182 450
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
|