http://qs321.pair.com?node_id=912011


in reply to Virus toy model^2

a probability of substituting any given position in the string with A C T G
Look at it as not only a project related to Perl since this posts an intersection of math and statistics (error rate, transition rates, transversion rates...etc), biology (simulating a molecular process of replication) and programming (implementing the computational algorithm using a programming language)

So, in each clone, what's the error (mutation) limit?, does the mutation position hold any significance? do you need to filter your genomes afterwards to select only those still-functional ones that were not corrupted by the mutations but where the mutation may present phenotypic differences?

My initial thought was to make a mutlidimensional array, where each successive round of replication would be represented as values in the new array dimension
If I understand you correctly, this type of array can be an arbitrarily-depth depending on the recursion of replication rounds, in Perl, it is not recommended you add or remove to an array while performing operations on it so you may have to initiate an array a-priori as indicated by the number of replications to be performed then populate that.

In light of the above, hidden markov models can use the transition probabilities while performing the mutations (maybe difficult when wanting to consider all the different cases out there), your best bet is to grab a statistician or to keep one handy. HMMs have been extensively used before in sequence alignment and detecting indels hence this scenario here is related.

A simpler approach is to use randomization to simulate mutations, check this discussion and its implementation or to consult the feasibility of using an already available solution.

UPDATE:Scanning the simulation program in the links I gave you; it seems to have a loose criteria on what base can be replaced with what so I followed another approach by introducing conditions on mutation types (favoring transitions rather than fatal transversions) and giving you control over the number of replication iterations then showing you the sequences aligned in proximity for good visuals

# a program that generates a 50 bp random DNA seq #and introduces 1 mutation per DNA replication use strict; use warnings; use Data::Dumper; #generating a 50 bp seq one base at a time. #you can read sets of seqs from a file instead my ($base,$virusGenome); my @nucleotide = ('a','t','g','c'); for(1..50){ $base = $nucleotide[rand(3)]; $virusGenome .=$base; } my @seqArray; #a hash table of the permissible mutations #no freqs rates, note the base case. my %mutationBounds=('a'=>'T','c'=>'G','g'=>'C','t'=>'A'); print "how many replications?"; chomp(my $rounds=<>); mutate(times=>$rounds, seq=>$virusGenome); sub mutate{ my %params=@_; print "original seq=> $params{seq}\n"; for my $i (0..$params{times}-1){ #convert seq from string to array @seqArray = split("",$params{seq}); #pick a base from a random position my $randPos = int(rand(length($params{seq}))); my $toReplace=$seqArray[$randPos]; #mutate based on %mutationBounds # assign to a seq the altered $virusGenome. my $seqMute = $params{seq}; substr ($seqMute, $randPos,1,$mutationBounds{$toReplace}); + print "mutant seq",$i+1,"=> ", $seqMute,"\n"; #which rep cycle is done. What was replaced where. #print "Cycle " ,$i+1, " $toReplace replaced w $mutationBounds +{$toReplace} @ $randPos\n"; } }; #print Dumper(\%mutationBounds); #print Dumper(\%params); #print Dumper(\@seqArray); ##OUTPUT hisham@bioslayer:~/Desktop$ perl virusToy.pl how many replications?3 original seq=> gaataggaataatagggagattgtaagagagtgggttagtatagttaata mutant seq1=> gaataggaataatagggagattgtaagagagtgggttagtatagttaaAa mutant seq2=> gaTtaggaataatagggagattgtaagagagtgggttagtatagttaata mutant seq3=> gaataggaataataCggagattgtaagagagtgggttagtatagttaata

David R. Gergen said "We know that second terms have historically been marred by hubris and by scandal." and I am a two y.o. monk today :D, June,12th, 2011...

Replies are listed 'Best First'.
Re^2: Virus toy model
by ZWcarp (Beadle) on Jun 30, 2011 at 00:01 UTC
    I just now saw this script that you added. Your knowledge of biology tells me you work in a related field. Most of the parameters such as rate, and a mutation probability table I will be taking directly from wet lab/ primary literature findings. I have some good estimates in mind. I am aware of the many issues with modeling biological systems, but in this case I have several natural test sets. This toy model is to be used only to assess the effect of uneven and perhaps biased sequence sampling. I will go over this script you sent tonight it looks much better than the one I made haha.
      my script overlooked the mutation probability scores per base (unperlish laziness and lack of intuition at that moment) whereas yours has focused mainly on that, my approach was to provide room for the more common mutations rather than consulting with lesser common ones that may or may not be categorized as nonsense. While the user has a say at providing the number of replications to be conducted (a more prudent approach will have to consider that as the replication is repeated over and over there are more chances for errors to show, but my program and so is yours it seems were more obsessed with making mutations happen at the first level of replication and not based on the hierarchy that you indicated in your OP that each seq may generate more seq in a tree fashion)...

      I find your incorporation of the mutation table so marvelous yet I have some comments on the implementation however that I'll share after running your program. Combining these two programs though is a worthwhile learning opportunity.

      p.s. Yes, I work as a bioinformatician


      David R. Gergen said "We know that second terms have historically been marred by hubris and by scandal." and I am a two y.o. monk today :D, June,12th, 2011...