Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Reduce RAM required

by onlyIDleft (Scribe)
on Jan 08, 2019 at 10:14 UTC ( [id://1228191]=perlquestion: print w/replies, xml ) Need Help??

onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:

Greetings exalted monks. I've had a multi-year break from coding and seek help in modifying a script I just wrote that processes an input file that has the following format: alternating lines of IDs and sequences

>ID1 atgagcagctag >ID2 tggacgagctgaca . . >IDn tgactagacggacatac

The goal here is to perform the following steps:

1. concatenate all these sequences,

2. shuffle substrings of it, each of of length l, as non-overlapping windows, and user pre-determined (hard coded to be 10^6 here)

3. concatenate the shuffled sequences,

4. split them into the same length distribution of sequences as in the input

5. wrote original IDs but shuffled sequences to the output file, in same format as in the input

For the example input above, ONE example output of randomly shuffled sequences would be:

>ID1 actggctagaag >ID2 cggaccagggacta . . >IDn ctgaaaaaggagccttt

The code I've written is quite infantile but works OK for files ~ 100-150<B in size. However the problem I see is that for input files that are even modestly sized ~ 500MB, even a run on my university computer cluster is running out of ~ 40GB RAM.

So I seek tips to modify the logic and/or syntax so it will execute to completion even for an input file of size ~ 2GB (largest one I need to process), preferably even on a local machine config (mine has 8GB RMA)

Thanks in advance!

use strict; use warnings; use List::Util 'shuffle'; # Idea from http://www.perlmonks.org/?node_i +d=199901 ###################################################################### +#################### my $start_time =time; my $input = shift @ARGV; my $destination = shift @ARGV; open(IN, '<', $input) or die "Can't read multifasta input genomic DNA +file $input : $!\n"; my $window = 1000000; # hard coded for shuffle window to be 1MB i.e 10 +^6 print "The length of the window shuffled is 1MB (i.e. 10^6 bp), it is +hard coded, you may change it in line 27 of this script. Thank you! \ +n"; my (@IDs, @lengths, @IDsLens, @output); my $concat_seq=""; my $total_length= 0; ###################################################################### +#################### while (<IN>) { chomp; if ($_ =~ m/\>/) { my $ID1 = my $ID2 = $_; push @IDs, $ID1; push @IDsLens, $ID2; } elsif ($_ !~ m/\>/) { my $len1 = my $len2 = length($_); push @lengths, $len1; $total_length = $total_length + $len1; push @IDsLens, $len2; $concat_seq = $concat_seq.$_; # concatenating the entire genome in +to one long sequence before breaking it for shuffling } } my %IDLen_hash = @IDsLens; ###################################################################### +#################### my $i = 0; my $cat_shuf_full_seq=""; for (my $i = 1; $i <= $total_length; $i=$i+$window ) { # perfo +rm this operation below every 1MB, see line 21 above my $s = substr ($concat_seq, $i - 1, $window); my $rev = reverse $s; my @temp_seq_array; if ($i % 2 == 0) {@temp_seq_array = split //, $rev;} # +throw in some reverse sequence alternatively to shuffle it more rando +mly elsif ($i % 2 != 0) {@temp_seq_array = split //, $s;} my @rand_seq_array = shuffle @temp_seq_array; # using the L +ist::Util module my $rand_shuffled_substr = join ('', @rand_seq_array,); $cat_shuf_full_seq = $cat_shuf_full_seq.$rand_shuffled_subs +tr; # concatenates the shuffled DNA seq to the 3' end of the previous + 1MB fragment } my @shuffle_concat_array = split("", $cat_shuf_full_seq); ###################################################################### +#################### foreach my $ID (@IDs) { my $contig_len = $IDLen_hash{$ID}; my @shuffle_concat_splice_seq = splice @shuffle_concat_array, 0, $cont +ig_len; # this will progressively reduce the length of @shuffle_concat_splice_ +seq # until the final splice operation will be for the length of the final + contig in original sequence # which should also be the same as the length of the remaining array a +ssigned to this array variable my $cat_shuf_splice_final_subseq = join('', @shuffle_concat_splice_seq +); print $ID, "\n"; push @output, $ID, "\n", $cat_shuf_splice_final_subseq, "\n"; } ###################################################################### +#################### #my @input_name_split = split(".fa.shIDscleaned-up",$input); #my $destination = $input_name_split[0]."_cat_shuf1MB_frag.fasta"; open(OUT, '>', $destination) or die "Can't write to file $destination: + $!\n"; print OUT @output; close OUT; my $end_time = time; my $duration = ($end_time - $start_time)/60; my $rounded = sprintf("%.1f", $duration); print "Finished processing input, written output to ", $destination, " + in ", $rounded, " minutes \n"; ###################################################################### +####################

Replies are listed 'Best First'.
Re: Reduce RAM required
by Corion (Patriarch) on Jan 08, 2019 at 10:44 UTC

    Some real low-hanging fruit that should roughly halve your memory requirement:

    You accumulate all output in @output and only write it out at the end. Instead, eliminate the variable @output and instead of pushing there, write to the output file directly.

    You keep the arrays @IDs, @lengths, @IDsLens and the values in them around for the whole runtime of your program. Insteadd, empty out those arrays as soon as you don't need them anymore. This is usually easiest done by restricting the scope of such variables to be the minimal necessary scope. For example the @IDsLens array is not needed after the assignemtn to the hash.

    For futher improvement, instead of building up @IDsLens and then assigning it to the hash, eliminate the variable @IDsLens completely and do the hash assignment in place:

    #push @IDsLens, $ID2; #push @IDsLens, $len2; $IDLen_hash{ $ID2 } = length( $ID2 ); ... #my %IDLen_hash = @IDsLens;

    You should test whether storing all the lengths in a hash is worth the improvement, compared to directly calculating the length of the string each time, because you only use that hash once to retrieve the length again.

      Thank you! I tried several of these changes, (like undef of several array and scalar variables), but my coding is quite rusty, because I am still getting the modified script to exit with error code 9 on both Mac local machine and UNIX cluster - running out of RAM (I think)

      I have to brush up on Perl coding quite a bit it seems :)

Re: Reduce RAM required
by hippo (Bishop) on Jan 08, 2019 at 10:36 UTC

    There is very little scoping used here. The @IDsLens array is very large and is apparently only used to construct the hash. Why? I can't see that this is necessary but even if you think that it is, the scope should be limited to this construction phase.

    I would be inclined to refactor the code into subs. It would make the purpose clearer, help to enforce some scoping and allow for better profiling.

    While it probably won't affect the RAM footprint, your code is verbose to the extent that it becomes harder to read, not easier. In the construction loop, for example you have this:

    if ($_ =~ m/\>/) { # ... } elsif ($_ !~ m/\>/) { # ... }

    which could equally be written as

    if (/>/) { # ... } else { # ... }

    While the way you have written it will work, it takes longer and more effort for the programmer to read and parse (and wonder why there is an elsif in there).

    I hope these tips are of some use to you.

      Thank you for those suggestions.

Re: Reduce RAM required
by Eily (Monsignor) on Jan 08, 2019 at 11:03 UTC

    Aren't the three letters in your sequences basically equiprobable (you have as many 'a', 't', 'g' and 'c' roughly appear as many times as each other)? If so, do you need the output file to have exactly the same number of occurence of each letter as the input? I'm not a biologist but the arbitrary - yet quite large - window size and shuffling makes me doubt the data is supposed to be meaningful in any way. Besides:

    # throw in some reverse sequence alternatively to shuffle it more randomly
    is useless at best. Trying to increase the randomness of some data without external input is either going to have no effect on the probabilities, or most likely make the output less random.

    If you don't care about matching exactly the occurence of each letter, your program just becomes "replace each sequence by a random sequence the same length", which can be coded in a few lines. If you do care, hdb's answer might be the way to go.

      1. ATGC frequencies are never equal except by fluke

      2. DNA sequences have periodicities or features at different size ranges

      3. Sequence read in forward and reverse orientations almost always yield different biological meanings

      But about how reversing alternatively, changing shuffle window size etc will change signal / noise ratio would be a multi-week study by itself

      So your advice on keeping it simple and easy in terms of the coding is well taken :) Thank you!

Re: Reduce RAM required
by bliako (Monsignor) on Jan 08, 2019 at 11:26 UTC

    I have exactly the same suggestion as hdb. To put it more formally, build the statistical distribution of your data and then sample it. If you need a more complex model which will also depend not only on frequency of single letters but also on frequency of pairs and triplets and so on, then build a multi-dimensional distribution or use a markov-chain. The latter is the tool being used for random text production which more or less does what you want: create new words from a corpus respecting their neighbouring properties as well as frequency of individual letters. So, it does one step more than what you do as it counts also the properties of neighbours. For example, with your method, "TTTTTTT" is as probable as "ATCGACACGT" in the shuffled sub-sequences, but in the real world the first one is a freak sequence and the second one just a normal looking one...

    Comment on your code: I think reversing the array will not add anything to the shuffle (if shuffle() is good, which is good). Secondly, you can eliminate a lot of split()/join() (after the shuffle) and just use substr() to break the whole genome to sub-sequences. Thirldy, refactoring your code to use subs, as hippo suggested will be good because then you can assess the efficiency of each step (RAM/CPU-wise as well as statistically) and you can try different methods by writing a new sub and plugging it in.

    Update: Eily also makes the same comment about statistics.

    bw, bliako

      Thanks for your response. I don't think the bioinformatics tool HMMER3 (making and using HMMs) can generate a whole shuffled genome sequence based on reading in its intact version.

      But if you know of any software package that can do this, that would be good to know

      About frequencies of 1 letters, 2 letters etc., REALLY good question - but I think I am happy with simple shuffling, without consideration for frequencies of 2 letters or higher order, because there is no theoretical or empirical indication that it would make a difference while assessing background signal from these shuffled sequences. But I might run that experiment in the future, for curiosity sake :)

        I have done an investigation of my own (on file ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/CHR_20/hs_ref_GRCh38.p12_chr20.fa.gz).

        Unless my method for calculating the frequencies is wrong or my script has a bug (quite likely actually), empirically at least, bases like to be next to some bases more than others, and the more they gather the more fussy they become. Almost a tenfold! Also note that this is for just 1 chromosome of HS. Maybe overall the differences even out which I don't believe.

        CG => 0.0121 TT => 0.0902 TAC => 0.0098 TGA => 0.0195

        Results when examining pairs of letters:

        Results when examining triplets of letters:

        I will post the script I used shortly on the Meditations section as it may be of more general use.

        bw, bliako

        @bliako - thank you for your investigation.

        I think I misspoke when I said "there is no theoretical or empirical indication that it would make a difference while assessing background signal from these shuffled sequences"

        What should I have instead said is that "periodicities and abundance biases of di-, tri-, or multi-mer sequences when completely destroyed are more likely to be truly random shuffles, rather than shuffles that retain frequencies of di-mer, or tri-mer etc. This is obvious from your own empirical observation from one human chromosome

        Anyways, I have an old script that uses shuffle from List::Util Perl module and I remember it never ran out of RAM, so I think I will use that. Though I would have preferred something that concats the entirety of the genome and then shuffles and fragments as per length distribution of chromosomes in input - but various Perl scripts that do this run out of RAM even on my cluster. Need to take a weekend break to figure out RAM usage and fix it.

Re: Reduce RAM required
by hdb (Monsignor) on Jan 08, 2019 at 10:35 UTC

    It looks to me as if - within each $window - you only need to count the occurrence of each of the 4 letters and then write a random sequence of those letters with the correct frequency of each. This only needs a bit of book keeping when reading and writing the files but should consume far less memory in the order of 4 times the number of windows.

    But then I may not really have understood your shuffling requirements...

      I agree with your approach of counting letters and using their frequencies

      About frequency of A/T/C/G agree again - but but I would much prefer to calculate it across ALL sequences, rather than just 1 sequence at a time - corresponding to one $ID, so that this frequency-compliant sequence randomization is based on global rather than local frequencies. I suspect it will result in even more reduced signal / noise ratio, though I have not tested that yet... Thank you!

Re: Reduce RAM required
by tybalt89 (Monsignor) on Jan 09, 2019 at 00:58 UTC

    Something like this?

    #!/usr/bin/perl # https://perlmonks.org/?node_id=1228191 use strict; use warnings; my $window = 1e6; my $A = my $C = my $G = my $T = my $all = 0; my (@sizes, $tmp, $start); sub letter { my $n = rand $all--; $n < $A ? ($A--, return 'a') : $n < $A + $C ? ($C--, return 'c') : $n < $A + $C + $G ? ($G--, return 'g') : ($T--, return 't'); } sub output { for my $count ( @sizes ) { print ">ID", $start++, "\n"; print map(letter(), 1 .. $count), "\n"; } @sizes = (); } while( <DATA> ) { $start //= s/\D+//gr; $_ = <DATA>; $A += tr/a//; $C += tr/c//; $G += tr/g//; $T += tr/t//; $all += $tmp = tr/acgt//; push @sizes, $tmp; $all >= $window and output(); } $all and output(); __DATA__ >ID1 atgagcagctag >ID2 tggacgagctgaca >ID3 tgactagacggacatac

      Thank you @ tybalt89. This should work. As I mentioned in my original post, my scripting is very rusty after nearly a 4 year gap. I understand only bits and pieces of your code.

      One guess is that your script requires IDs to end in numbers, which you rely on to process and split the input data? Yes, no , may be? I'd prefer the distinction between ID line and sequence line be based on whether it is preceded by ">" symbol or not, for ID and sequence, respectively, please. How should i modify the script for that?

      It appears to me that the script s counting A/T/G/C for each sequence individually, correct? If not, please skip. But if yes, then how much more RAM-hungry would a modification be where the A/T/G/C count frequency across ALL sequences are first calculated BEFORE generating sequences that match those frequenciues?

      Finally, how do I accept input through a FH and output to a new FH. I tried several mods to your script, but only a few worked out. Hence this request for your additional assistance. Thanks a ton!

        Changed to use '>'.

        It counts A/T/G/C for each "window", not each sequence. That seemed to be what your program was doing. Its random generation over a "window" is equivalent to the shuffle over a "window" you were doing.

        Uses file handles.

        #!/usr/bin/perl # https://perlmonks.org/?node_id=1228191 use strict; use warnings; my $window = 1e6; my $A = my $C = my $G = my $all = 0; my (@sizes, $tmp, $start); my $inputfile = 'd.1228191'; my $outputfile = 'd.out.1228191'; open my $in, '<', $inputfile or die "$! opening $inputfile"; open my $out, '>', $outputfile or die "$! opening $outputfile"; sub letter { my $n = int rand $all--; $n < $A ? ($A--, return 'a') : $n < $A + $C ? ($C--, return 'c') : $n < $A + $C + $G ? ($G--, return 'g') : return 't'; } sub output { for my $count ( @sizes ) { print $out ">ID", $start++, "\n", map(letter(), 1 .. $count), "\n" +; } @sizes = (); } while( <$in> ) { if( /^>/ ) { $start //= s/\D+//gr; } elsif( /^[acgt]/ ) { $A += tr/a//; $C += tr/c//; $G += tr/g//; $all += $tmp = tr/acgt//; push @sizes, $tmp; $all >= $window and output(); } } $all and output(); close $in; close $out;
Re: Reduce RAM required
by swl (Parson) on Jan 08, 2019 at 22:53 UTC

      Thank you! I need to learn how to trace and prevent memory leaks, so those links will be useful in the future.

Re: Reduce RAM required
by jimpudar (Pilgrim) on Jan 08, 2019 at 22:22 UTC

    Maybe I'm not understanding your requirement, but it seems to me you could just read in one window's worth of sequences, shuffle them, and write them to the file. Then you could empty the window buffer and repeat until you reach EOF. This way, your memory will never grow larger than the window size.

    The algorithm is a little trickier that way because you need to deal with sequences that overflow the window, but I don't think it should be too hard.

    EDIT: After reading hdb's comment, I would also suggest you go that route. That seems even better than what I was describing.

    πάντων χρημάτων μέτρον έστιν άνθρωπος.

      I want to obliterate signal by introducing ample noise by random shuffling. But think about it - if I bring distant regions of the sequences together and then shuffle them (or shuffle them and then bring them together), then it is more of global rather than local shuffling - which is something I prefer. You agree with my approach?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1228191]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2024-04-23 12:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found