http://qs321.pair.com?node_id=911993

ZWcarp has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, Thanks again for the input.

I have a bit of an interesting research project on my hands and I was looking to get some advice on the best ( and most efficient way) to accomplish modeling the clonal expansion of a virus.

So here's the set up. When a virus clonally expands it does so from an initial DNA sequence (its genome). Each virus has its own intrinsic error rate that results in mutations of that initial DNA in each round of copy. Think of a mutation as being string1=ATG to string2 ATC. Mutations can only be to another nucleotide.. thus only A C T or G. Now I want to start with one DNA string, mutate it based on a rate ( represented in perl as a probability of substituting any given position in the string with A C T G), and then perform this same function on each succession group of the "progeny of the virus".

In the simplest model lets say the initial string should create five successive mutant strings, then each of these mutant strings each create 5 of their own mutant strings giving us 5*5=25 final sequences. Picture a growing tree where the tip of each branch, then becomes a node where five new branches form.

Now I am trying to think about the most efficient way to implement this because a genome is usually a string of 1000+ characters. My initial thought was to make a mutlidimensional array, where each successive round of replication would be represented as values in the new array dimension. Does anyone know how to have perl dynamically add new dimensions to arrays, so that I only have as many final dimensions as the specified replication rounds. I though about perhaps implementing hash of hash of hash of array etc... but I am running into the same problem...if I want to do say 1000 replications it will just take to long to write that out myself.

My apologies I forgot to specify that after each replication you have to select a subgroup to then go on to the next generation. You can think of this biologically as negative selection... ie a virus infects a person and lets say the person dies( meaning the virus dies too), or say the person gets better without infecting anyone else, or that the virus mutates something critical that it needs for function(infectivity,,receptorbinding... stability...etc). Regardless, what this effectively means is although you replicate 1000 times you are are only picking a random number...lets say 3....5 form the total new generation. You're all totally correct that 5^1000 is absurdly large, i'm an idiot for not specifying haha.

here is my code so far
#!/usr/bin/perl -w use strict; use List::Util 'shuffle'; use List::Util 'shuffle'; #g=generation #G=Generation # my @g0; my %allgen; my $i; my $j; @g0= ("ATTGTACATGTAGTTCTA"); #Initial generation "seeding generation" %allgen=(0 =>\@g0); for($i=1;$i<20;$i++) # This $i determines the number of replic +ations { my @new_gen=(); my @last_gen=@{$allgen{$i-1}}; foreach my $seq_string(@last_gen) { chomp $seq_string; my @nucleotide_array = split(//,$seq_string); my $rand_virus=(int rand(0+10)); # this will randomly select a + number of viruses that will seed the new generation from the pool of + the last generation for($j=0;$j<$rand_virus;$j++) #This loop should allow o +nly $rand_virus sequnces from the new generation to be selected for"m +ultiplying" { my $new_string; foreach my $nucleotide(@nucleotide_array) { chomp $nucleotide; my($T,$G,$C,$A); #These mutations rates are just for testing no +w they dont reflect accurate viral mutation rates yet if ($nucleotide eq 'A'){ # $T=.1;$G=.2;$C=.3;$A=.7; my $score = (int rand(0+1000))/100 +0; if( $score>=000 && $score<=.099){ +$nucleotide=~s/A/T/g;} if( $score>=.100 && $score<=.199){ + $nucleotide=~s/A/G/g;} if( $score>=.200 && $score<=.299){ + $nucleotide=~s/A/C/g;} # if( $score>=.300 && $score<=1.00) +{ Do NOTHING } } } elsif ($nucleotide eq 'T'){ # $A=.1;$G=.2;$C=.3;$T=.7; my $score = (int rand(0+1000))/100 +0; if( $score>=0 && $score<=.099){ $n +ucleotide=~s/T/A/g;} if( $score>=.100 && $score<=.199){ + $nucleotide=~s/T/G/g;} if( $score>=.200 && $score<=.299){ + $nucleotide=~s/T/C/g;} #if( $score>=.300 && $score<=1.00) +{ DO NOTHING } } } elsif ($nucleotide eq 'C'){ # $T=.1;$G=.2;$A=.3;$C=.7; my $score = (int rand(0+1000))/100 +0; if( $score>=0 && $score<=.099){ $n +ucleotide=~s/C/T/g;} if( $score>=.100 && $score<=.199){ + $nucleotide=~s/C/G/g;} if( $score>=.200 && $score<=.299){ + $nucleotide=~s/C/A/g;} #if( $score>=.300 && $score<=1.00) +{ #No mutation} } elsif ($nucleotide eq 'G'){ # $T=.1;$A=.2;$C=.3;$G=.7; my $score = (int rand(0+1000))/100 +0; if( $score>=0 && $score<=.099){ $n +ucleotide=~s/G/T/g;} if( $score>=.100 && $score<=.199){ + $nucleotide=~s/G/A/g;} if( $score>=.200 && $score<=.299){ + $nucleotide=~s/G/C/g;} #if( $score>=.300 && $score<=1.00) +{ #No mutation} } $new_string= $new_string . $nucleotide; } push(@new_gen,$new_string); } } @new_gen=shuffle(@new_gen); #print $allgen{0}[0] . "\n"; $allgen{$i}=\@new_gen; } for my $ii(0..20){ for (my $jj=0;$jj<(length($jj));$jj++){ print "$ii\t$allgen{$ii}[$jj] \n"; } }
Any thoughts on this implementation.... or anyways to make this more efficient? Ideally I'd like to run this on whole genome (much larger input DNA string), and have a replication for everyday in a ~7 month period..which roughly correlates with the seasonality of the virus I am looking at.