Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Virus toy model^2

by ZWcarp (Beadle)
on Jun 29, 2011 at 16:36 UTC ( [id://911993]=perlquestion: print w/replies, xml ) Need Help??

ZWcarp has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, Thanks again for the input.

I have a bit of an interesting research project on my hands and I was looking to get some advice on the best ( and most efficient way) to accomplish modeling the clonal expansion of a virus.

So here's the set up. When a virus clonally expands it does so from an initial DNA sequence (its genome). Each virus has its own intrinsic error rate that results in mutations of that initial DNA in each round of copy. Think of a mutation as being string1=ATG to string2 ATC. Mutations can only be to another nucleotide.. thus only A C T or G. Now I want to start with one DNA string, mutate it based on a rate ( represented in perl as a probability of substituting any given position in the string with A C T G), and then perform this same function on each succession group of the "progeny of the virus".

In the simplest model lets say the initial string should create five successive mutant strings, then each of these mutant strings each create 5 of their own mutant strings giving us 5*5=25 final sequences. Picture a growing tree where the tip of each branch, then becomes a node where five new branches form.

Now I am trying to think about the most efficient way to implement this because a genome is usually a string of 1000+ characters. My initial thought was to make a mutlidimensional array, where each successive round of replication would be represented as values in the new array dimension. Does anyone know how to have perl dynamically add new dimensions to arrays, so that I only have as many final dimensions as the specified replication rounds. I though about perhaps implementing hash of hash of hash of array etc... but I am running into the same problem...if I want to do say 1000 replications it will just take to long to write that out myself.

My apologies I forgot to specify that after each replication you have to select a subgroup to then go on to the next generation. You can think of this biologically as negative selection... ie a virus infects a person and lets say the person dies( meaning the virus dies too), or say the person gets better without infecting anyone else, or that the virus mutates something critical that it needs for function(infectivity,,receptorbinding... stability...etc). Regardless, what this effectively means is although you replicate 1000 times you are are only picking a random number...lets say 3....5 form the total new generation. You're all totally correct that 5^1000 is absurdly large, i'm an idiot for not specifying haha.

here is my code so far
#!/usr/bin/perl -w use strict; use List::Util 'shuffle'; use List::Util 'shuffle'; #g=generation #G=Generation # my @g0; my %allgen; my $i; my $j; @g0= ("ATTGTACATGTAGTTCTA"); #Initial generation "seeding generation" %allgen=(0 =>\@g0); for($i=1;$i<20;$i++) # This $i determines the number of replic +ations { my @new_gen=(); my @last_gen=@{$allgen{$i-1}}; foreach my $seq_string(@last_gen) { chomp $seq_string; my @nucleotide_array = split(//,$seq_string); my $rand_virus=(int rand(0+10)); # this will randomly select a + number of viruses that will seed the new generation from the pool of + the last generation for($j=0;$j<$rand_virus;$j++) #This loop should allow o +nly $rand_virus sequnces from the new generation to be selected for"m +ultiplying" { my $new_string; foreach my $nucleotide(@nucleotide_array) { chomp $nucleotide; my($T,$G,$C,$A); #These mutations rates are just for testing no +w they dont reflect accurate viral mutation rates yet if ($nucleotide eq 'A'){ # $T=.1;$G=.2;$C=.3;$A=.7; my $score = (int rand(0+1000))/100 +0; if( $score>=000 && $score<=.099){ +$nucleotide=~s/A/T/g;} if( $score>=.100 && $score<=.199){ + $nucleotide=~s/A/G/g;} if( $score>=.200 && $score<=.299){ + $nucleotide=~s/A/C/g;} # if( $score>=.300 && $score<=1.00) +{ Do NOTHING } } } elsif ($nucleotide eq 'T'){ # $A=.1;$G=.2;$C=.3;$T=.7; my $score = (int rand(0+1000))/100 +0; if( $score>=0 && $score<=.099){ $n +ucleotide=~s/T/A/g;} if( $score>=.100 && $score<=.199){ + $nucleotide=~s/T/G/g;} if( $score>=.200 && $score<=.299){ + $nucleotide=~s/T/C/g;} #if( $score>=.300 && $score<=1.00) +{ DO NOTHING } } } elsif ($nucleotide eq 'C'){ # $T=.1;$G=.2;$A=.3;$C=.7; my $score = (int rand(0+1000))/100 +0; if( $score>=0 && $score<=.099){ $n +ucleotide=~s/C/T/g;} if( $score>=.100 && $score<=.199){ + $nucleotide=~s/C/G/g;} if( $score>=.200 && $score<=.299){ + $nucleotide=~s/C/A/g;} #if( $score>=.300 && $score<=1.00) +{ #No mutation} } elsif ($nucleotide eq 'G'){ # $T=.1;$A=.2;$C=.3;$G=.7; my $score = (int rand(0+1000))/100 +0; if( $score>=0 && $score<=.099){ $n +ucleotide=~s/G/T/g;} if( $score>=.100 && $score<=.199){ + $nucleotide=~s/G/A/g;} if( $score>=.200 && $score<=.299){ + $nucleotide=~s/G/C/g;} #if( $score>=.300 && $score<=1.00) +{ #No mutation} } $new_string= $new_string . $nucleotide; } push(@new_gen,$new_string); } } @new_gen=shuffle(@new_gen); #print $allgen{0}[0] . "\n"; $allgen{$i}=\@new_gen; } for my $ii(0..20){ for (my $jj=0;$jj<(length($jj));$jj++){ print "$ii\t$allgen{$ii}[$jj] \n"; } }
Any thoughts on this implementation.... or anyways to make this more efficient? Ideally I'd like to run this on whole genome (much larger input DNA string), and have a replication for everyday in a ~7 month period..which roughly correlates with the seasonality of the virus I am looking at.

Replies are listed 'Best First'.
Re: Virus toy model
by biohisham (Priest) on Jun 29, 2011 at 18:41 UTC
    a probability of substituting any given position in the string with A C T G
    Look at it as not only a project related to Perl since this posts an intersection of math and statistics (error rate, transition rates, transversion rates...etc), biology (simulating a molecular process of replication) and programming (implementing the computational algorithm using a programming language)

    So, in each clone, what's the error (mutation) limit?, does the mutation position hold any significance? do you need to filter your genomes afterwards to select only those still-functional ones that were not corrupted by the mutations but where the mutation may present phenotypic differences?

    My initial thought was to make a mutlidimensional array, where each successive round of replication would be represented as values in the new array dimension
    If I understand you correctly, this type of array can be an arbitrarily-depth depending on the recursion of replication rounds, in Perl, it is not recommended you add or remove to an array while performing operations on it so you may have to initiate an array a-priori as indicated by the number of replications to be performed then populate that.

    In light of the above, hidden markov models can use the transition probabilities while performing the mutations (maybe difficult when wanting to consider all the different cases out there), your best bet is to grab a statistician or to keep one handy. HMMs have been extensively used before in sequence alignment and detecting indels hence this scenario here is related.

    A simpler approach is to use randomization to simulate mutations, check this discussion and its implementation or to consult the feasibility of using an already available solution.

    UPDATE:Scanning the simulation program in the links I gave you; it seems to have a loose criteria on what base can be replaced with what so I followed another approach by introducing conditions on mutation types (favoring transitions rather than fatal transversions) and giving you control over the number of replication iterations then showing you the sequences aligned in proximity for good visuals

    # a program that generates a 50 bp random DNA seq #and introduces 1 mutation per DNA replication use strict; use warnings; use Data::Dumper; #generating a 50 bp seq one base at a time. #you can read sets of seqs from a file instead my ($base,$virusGenome); my @nucleotide = ('a','t','g','c'); for(1..50){ $base = $nucleotide[rand(3)]; $virusGenome .=$base; } my @seqArray; #a hash table of the permissible mutations #no freqs rates, note the base case. my %mutationBounds=('a'=>'T','c'=>'G','g'=>'C','t'=>'A'); print "how many replications?"; chomp(my $rounds=<>); mutate(times=>$rounds, seq=>$virusGenome); sub mutate{ my %params=@_; print "original seq=> $params{seq}\n"; for my $i (0..$params{times}-1){ #convert seq from string to array @seqArray = split("",$params{seq}); #pick a base from a random position my $randPos = int(rand(length($params{seq}))); my $toReplace=$seqArray[$randPos]; #mutate based on %mutationBounds # assign to a seq the altered $virusGenome. my $seqMute = $params{seq}; substr ($seqMute, $randPos,1,$mutationBounds{$toReplace}); + print "mutant seq",$i+1,"=> ", $seqMute,"\n"; #which rep cycle is done. What was replaced where. #print "Cycle " ,$i+1, " $toReplace replaced w $mutationBounds +{$toReplace} @ $randPos\n"; } }; #print Dumper(\%mutationBounds); #print Dumper(\%params); #print Dumper(\@seqArray); ##OUTPUT hisham@bioslayer:~/Desktop$ perl virusToy.pl how many replications?3 original seq=> gaataggaataatagggagattgtaagagagtgggttagtatagttaata mutant seq1=> gaataggaataatagggagattgtaagagagtgggttagtatagttaaAa mutant seq2=> gaTtaggaataatagggagattgtaagagagtgggttagtatagttaata mutant seq3=> gaataggaataataCggagattgtaagagagtgggttagtatagttaata

    David R. Gergen said "We know that second terms have historically been marred by hubris and by scandal." and I am a two y.o. monk today :D, June,12th, 2011...
      I just now saw this script that you added. Your knowledge of biology tells me you work in a related field. Most of the parameters such as rate, and a mutation probability table I will be taking directly from wet lab/ primary literature findings. I have some good estimates in mind. I am aware of the many issues with modeling biological systems, but in this case I have several natural test sets. This toy model is to be used only to assess the effect of uneven and perhaps biased sequence sampling. I will go over this script you sent tonight it looks much better than the one I made haha.
        my script overlooked the mutation probability scores per base (unperlish laziness and lack of intuition at that moment) whereas yours has focused mainly on that, my approach was to provide room for the more common mutations rather than consulting with lesser common ones that may or may not be categorized as nonsense. While the user has a say at providing the number of replications to be conducted (a more prudent approach will have to consider that as the replication is repeated over and over there are more chances for errors to show, but my program and so is yours it seems were more obsessed with making mutations happen at the first level of replication and not based on the hierarchy that you indicated in your OP that each seq may generate more seq in a tree fashion)...

        I find your incorporation of the mutation table so marvelous yet I have some comments on the implementation however that I'll share after running your program. Combining these two programs though is a worthwhile learning opportunity.

        p.s. Yes, I work as a bioinformatician


        David R. Gergen said "We know that second terms have historically been marred by hubris and by scandal." and I am a two y.o. monk today :D, June,12th, 2011...
Re: Virus toy model
by BrowserUk (Patriarch) on Jun 29, 2011 at 20:16 UTC

    Even if every star in every galaxy in the universe had a duplicate earth, there wouldn't be enough disks & tape on all of them combined to store all the progeny of 1000 generations of a single 1000-codon sequence.

    Taking a 1000-codon sequence through one possible lineage for 1000 generations takes maybe a millisecond, so you could fill a disk with possible end points very quickly.

    But without some selection criteria, any single lineage of random mutations is as likely as any other, so there is nothing to be learnt from statistically sampling the final generation. All you'd be testing is the distribution of your source of randomness.

    So the question becomes, what were you hoping to learn from the outcome of this experiment? If we knew what you were hoping to investigate, we might be able to suggest plausible approaches.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Virus toy model
by jethro (Monsignor) on Jun 29, 2011 at 16:53 UTC

    You want to do 1000 replications? If I understand you correctly that would amount to 5^1000 sequences, a number too big for my calculator. 5^100 is already a number with 69 digits

    I hope you have a plan B ;-)

    Earnestly, what do you want to do with those sequences eventually ? If you want some statistics out of them, maybe you can get away with math instead of storing all those sequences

Re: Virus toy model
by MidLifeXis (Monsignor) on Jun 29, 2011 at 17:20 UTC

    Do you care about the individual nodes, or just each current genome sequence (not a geneticist, so pardon any terminology mess-ups) and the number of each in existence?

    You are dealing with large numbers of values at the mutation rates and generations in your example. As was previously mentioned, 5**1000 is quite large. Is there any dieoff rate or fitness test at each generation that could be used to prune the dataset?

    Perhaps a more manageable structure (although not at these rates and generations) would be a Hash or HoH (hash-of-hashes), where each key is the ACTG sequence, and the value is a hash with the current number of virii with that sequence, as well as some other bookkeeping information. If you don't need other bookkeeping information, perhaps a simple hash would do. Worst storage case on this (if I am thinking through this correctly) should be no worse (given a scaling factor) than an array of all of the virii (not that this value is small). If each mutation must work in groups of 3 ACGT values, perhaps you could even store your string as a bitstring of some sort (4**3 == 64).

    Death of a virus would be a -- operation, and a successful mutation would be a ++ operation.

    --MidLifeXis

Re: Virus toy model
by toolic (Bishop) on Jun 30, 2011 at 00:52 UTC
    There is something wrong with the way I am passing the arrays to the hash .. it will output ie ARRAY(0x1f512580), not sure where exactly the problem is.
    You need to deference the array. Change:
    my @last_gen=$allgen{$i-1};
    to:
    my @last_gen=@{$allgen{$i-1}};
    perldsc
      Thankyou dearly. I spent way too much time trying to figure that out! Also why is it that %hash=(0=>A) must be followed by $hash{1}=B ( for next key value pair you set I mean. ) rather than using the first format twice? I can't find this anywhere in the camel book or online.

        I guess you are asking why you can't do this:

        %hash=(0=>A); %hash=(1=>B);

        Both these operations operate on the whole hash, initializing it with the data on the right side of the equation. The equation sign "=" means literally "the left side is made equal to the right side", not "the right side is added to or combined with the left side". So after the second equation any previous values in the hash will be lost

        While when you use $hash{1}=B you only manipulate one value inside the hash, leaving the rest of the hash unchanged.

        Thankyou dearly. I spent way too much time trying to figure that out! Also why is it that %hash=(0=>A) must be followed by $hash{1}=B ( for next key value pair you set I mean. ) rather than using the first format twice? I can't find this anywhere in the camel book or online.

        It does not have to be followed. This is covered in perlintro under Perl variable types, or if you're new to programming, http://learn.perl.org/books/beginning-perl/

Re: Virus toy model
by zentara (Archbishop) on Jun 29, 2011 at 16:56 UTC
    Does anyone know how to have perl dynamically add new dimensions to arrays, so that I only have as many final dimensions as the specified replication rounds

    IIUC< Why not use AoA, array of arrays? Maybe an idea can be had in Making little array babies

Re: Virus toy model
by Anonymous Monk on Jun 30, 2011 at 01:01 UTC
    Just when you update a node please mark it accordingly so the development history of the node falls in prospect... Data Dumper can allow you to see whether or not your data structures have what you expect them to have.

    ARRAY(0x1f512580) is a pointer in memory referring to the contents held in ARRAY, so to access these contents you need to dereference the reference References quick reference and Referencing in advanced data structures and of course the documentation perlref and perlreftut

    [Tutorials] section at the monastery has many useful teachings too
Re: Virus toy model
by dcmertens (Scribe) on Jun 29, 2011 at 22:04 UTC
      Not really cuz not every biology problem is a BioPerl problem and not every BioPerl individual module can only be applied solely to biology problems. Biology in IT requires other tools from mother Perl such as PDL and Math::* modules for example.

      Read Perl and Bioinformatics

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://911993]
Approved by Corion
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-04-25 10:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found