BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:
Update: Realisation dawned
The solution is:
#! perl -slw use strict; sub fact{ my $n = shift; my $f = $n; $f *= $n while --$n; return $f; } sub nCr{ my( $n, $r ) = @_; return fact( $n ) / ( fact( $r ) * fact( $n - $r ) ); } sub NwithM { my $f = 4-1; my( $L, $M ) = @_; my $T = 1; for my $m ( 1 .. $M ) { $T += nCr( $L, $m ) * $f**$m; } return $T; } print "8 with $_ mismatch(es): ", NwithM( 8, $_ ) for 1 .. 4; print "9 with $_ mismatch(es): ", NwithM( 9, $_ ) for 1 .. 4;
Produces:
C:\test\humanGenome>fuzzyDNAcalc.pl 8 with 1 mismatch(es): 25 8 with 2 mismatch(es): 277 8 with 3 mismatch(es): 1789 8 with 4 mismatch(es): 7459 9 with 1 mismatch(es): 28 9 with 2 mismatch(es): 352 9 with 3 mismatch(es): 2620 9 with 4 mismatch(es): 12826
All my searches turned up algorithms for generating these sequences, but I haven't found, or been able to derive a formula for calculating them.
Given a string of (a|c|g|t)x8; and allowing for up to 2 mismatched characters, how many possible matches are there.
There are 4^8 = 65536 possible 8-base sequences.
By experiment I know that allowing for
- M=2 mismatches, there are 277 possible matches for any given 8-base string.
- M=3 mismatches, there are 1789 possible matches for any given 8-base string.
- M-4 mismatches, there are 7459 possible matches for any given 8-base string.
I know I am probably missing the obvious, but I'm not seeing the formula?
If it helps, for L=9, total sequences is 4^9 = 262144; and the number of matches for various values of M is:
- L=9, M=2, T=352
- L=9, M=3, T=2620
- L=9, M=4, T=12826
Do you know, or are you able to derive the formula in terms of T = f( L, M )?
>Thanks.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Calculating the number of possible matches with N mismatches for DNA? [Solved!]
by hdb (Monsignor) on May 27, 2015 at 07:22 UTC | |
by BrowserUk (Patriarch) on May 27, 2015 at 07:35 UTC |