onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:
Greetings exalted monks. I've had a multi-year break from coding and seek help in modifying a script I just wrote that processes an input file that has the following format: alternating lines of IDs and sequences
>ID1
atgagcagctag
>ID2
tggacgagctgaca
.
.
>IDn
tgactagacggacatac
The goal here is to perform the following steps:
1. concatenate all these sequences,
2. shuffle substrings of it, each of of length l, as non-overlapping windows, and user pre-determined (hard coded to be 10^6 here)
3. concatenate the shuffled sequences,
4. split them into the same length distribution of sequences as in the input
5. wrote original IDs but shuffled sequences to the output file, in same format as in the input
For the example input above, ONE example output of randomly shuffled sequences would be:
>ID1
actggctagaag
>ID2
cggaccagggacta
.
.
>IDn
ctgaaaaaggagccttt
The code I've written is quite infantile but works OK for files ~ 100-150<B in size. However the problem I see is that for input files that are even modestly sized ~ 500MB, even a run on my university computer cluster is running out of ~ 40GB RAM.
So I seek tips to modify the logic and/or syntax so it will execute to completion even for an input file of size ~ 2GB (largest one I need to process), preferably even on a local machine config (mine has 8GB RMA)
Thanks in advance!
use strict;
use warnings;
use List::Util 'shuffle'; # Idea from http://www.perlmonks.org/?node_i
+d=199901
######################################################################
+####################
my $start_time =time;
my $input = shift @ARGV;
my $destination = shift @ARGV;
open(IN, '<', $input) or die "Can't read multifasta input genomic DNA
+file $input : $!\n";
my $window = 1000000; # hard coded for shuffle window to be 1MB i.e 10
+^6
print "The length of the window shuffled is 1MB (i.e. 10^6 bp), it is
+hard coded, you may change it in line 27 of this script. Thank you! \
+n";
my (@IDs, @lengths, @IDsLens, @output);
my $concat_seq="";
my $total_length= 0;
######################################################################
+####################
while (<IN>) {
chomp;
if ($_ =~ m/\>/) {
my $ID1 = my $ID2 = $_;
push @IDs, $ID1;
push @IDsLens, $ID2;
}
elsif ($_ !~ m/\>/) {
my $len1 = my $len2 = length($_);
push @lengths, $len1;
$total_length = $total_length + $len1;
push @IDsLens, $len2;
$concat_seq = $concat_seq.$_; # concatenating the entire genome in
+to one long sequence before breaking it for shuffling
}
}
my %IDLen_hash = @IDsLens;
######################################################################
+####################
my $i = 0;
my $cat_shuf_full_seq="";
for (my $i = 1; $i <= $total_length; $i=$i+$window ) { # perfo
+rm this operation below every 1MB, see line 21 above
my $s = substr ($concat_seq, $i - 1, $window);
my $rev = reverse $s;
my @temp_seq_array;
if ($i % 2 == 0) {@temp_seq_array = split //, $rev;} #
+throw in some reverse sequence alternatively to shuffle it more rando
+mly
elsif ($i % 2 != 0) {@temp_seq_array = split //, $s;}
my @rand_seq_array = shuffle @temp_seq_array; # using the L
+ist::Util module
my $rand_shuffled_substr = join ('', @rand_seq_array,);
$cat_shuf_full_seq = $cat_shuf_full_seq.$rand_shuffled_subs
+tr; # concatenates the shuffled DNA seq to the 3' end of the previous
+ 1MB fragment
}
my @shuffle_concat_array = split("", $cat_shuf_full_seq);
######################################################################
+####################
foreach my $ID (@IDs) {
my $contig_len = $IDLen_hash{$ID};
my @shuffle_concat_splice_seq = splice @shuffle_concat_array, 0, $cont
+ig_len;
# this will progressively reduce the length of @shuffle_concat_splice_
+seq
# until the final splice operation will be for the length of the final
+ contig in original sequence
# which should also be the same as the length of the remaining array a
+ssigned to this array variable
my $cat_shuf_splice_final_subseq = join('', @shuffle_concat_splice_seq
+);
print $ID, "\n";
push @output, $ID, "\n", $cat_shuf_splice_final_subseq, "\n";
}
######################################################################
+####################
#my @input_name_split = split(".fa.shIDscleaned-up",$input);
#my $destination = $input_name_split[0]."_cat_shuf1MB_frag.fasta";
open(OUT, '>', $destination) or die "Can't write to file $destination:
+ $!\n";
print OUT @output;
close OUT;
my $end_time = time;
my $duration = ($end_time - $start_time)/60;
my $rounded = sprintf("%.1f", $duration);
print "Finished processing input, written output to ", $destination, "
+ in ", $rounded, " minutes \n";
######################################################################
+####################
Re: Reduce RAM required
by Corion (Patriarch) on Jan 08, 2019 at 10:44 UTC
|
Some real low-hanging fruit that should roughly halve your memory requirement:
You accumulate all output in @output and only write it out at the end. Instead, eliminate the variable @output and instead of pushing there, write to the output file directly.
You keep the arrays @IDs, @lengths, @IDsLens and the values in them around for the whole runtime of your program. Insteadd, empty out those arrays as soon as you don't need them anymore. This is usually easiest done by restricting the scope of such variables to be the minimal necessary scope. For example the @IDsLens array is not needed after the assignemtn to the hash.
For futher improvement, instead of building up @IDsLens and then assigning it to the hash, eliminate the variable @IDsLens completely and do the hash assignment in place:
#push @IDsLens, $ID2;
#push @IDsLens, $len2;
$IDLen_hash{ $ID2 } = length( $ID2 );
...
#my %IDLen_hash = @IDsLens;
You should test whether storing all the lengths in a hash is worth the improvement, compared to directly calculating the length of the string each time, because you only use that hash once to retrieve the length again. | [reply] [d/l] [select] |
|
Thank you! I tried several of these changes, (like undef of several array and scalar variables), but my coding is quite rusty, because I am still getting the modified script to exit with error code 9 on both Mac local machine and UNIX cluster - running out of RAM (I think)
I have to brush up on Perl coding quite a bit it seems :)
| [reply] |
Re: Reduce RAM required
by hippo (Bishop) on Jan 08, 2019 at 10:36 UTC
|
There is very little scoping used here. The @IDsLens array is very large and is apparently only used to construct the hash. Why? I can't see that this is necessary but even if you think that it is, the scope should be limited to this construction phase.
I would be inclined to refactor the code into subs. It would make the purpose clearer, help to enforce some scoping and allow for better profiling.
While it probably won't affect the RAM footprint, your code is verbose to the extent that it becomes harder to read, not easier. In the construction loop, for example you have this:
if ($_ =~ m/\>/) {
# ...
}
elsif ($_ !~ m/\>/) {
# ...
}
which could equally be written as
if (/>/) {
# ...
}
else {
# ...
}
While the way you have written it will work, it takes longer and more effort for the programmer to read and parse (and wonder why there is an elsif in there).
I hope these tips are of some use to you. | [reply] [d/l] [select] |
|
| [reply] |
Re: Reduce RAM required
by Eily (Monsignor) on Jan 08, 2019 at 11:03 UTC
|
Aren't the three letters in your sequences basically equiprobable (you have as many 'a', 't', 'g' and 'c' roughly appear as many times as each other)? If so, do you need the output file to have exactly the same number of occurence of each letter as the input? I'm not a biologist but the arbitrary - yet quite large - window size and shuffling makes me doubt the data is supposed to be meaningful in any way. Besides:
# throw in some reverse sequence alternatively to shuffle it more randomly
is useless at best. Trying to increase the randomness of some data without external input is either going to have no effect on the probabilities, or most likely make the output less random.
If you don't care about matching exactly the occurence of each letter, your program just becomes "replace each sequence by a random sequence the same length", which can be coded in a few lines. If you do care, hdb's answer might be the way to go.
| [reply] |
|
1. ATGC frequencies are never equal except by fluke
2. DNA sequences have periodicities or features at different size ranges
3. Sequence read in forward and reverse orientations almost always yield different biological meanings
But about how reversing alternatively, changing shuffle window size etc will change signal / noise ratio would be a multi-week study by itself
So your advice on keeping it simple and easy in terms of the coding is well taken :) Thank you!
| [reply] |
Re: Reduce RAM required
by bliako (Monsignor) on Jan 08, 2019 at 11:26 UTC
|
I have exactly the same suggestion as hdb. To put it more formally, build the statistical distribution of your data and then sample it. If you need a more complex model which will also depend not only on frequency of single letters but also on frequency of pairs and triplets and so on, then build a multi-dimensional distribution or use a markov-chain. The latter is the tool being used for random text production which more or less does what you want: create new words from a corpus respecting their neighbouring properties as well as frequency of individual letters. So, it does one step more than what you do as it counts also the properties of neighbours. For example, with your method, "TTTTTTT" is as probable as "ATCGACACGT" in the shuffled sub-sequences, but in the real world the first one is a freak sequence and the second one just a normal looking one...
Comment on your code: I think reversing the array will not add anything to the shuffle (if shuffle() is good, which is good). Secondly, you can eliminate a lot of split()/join() (after the shuffle) and just use substr() to break the whole genome to sub-sequences. Thirldy, refactoring your code to use subs, as hippo suggested will be good because then you can assess the efficiency of each step (RAM/CPU-wise as well as statistically) and you can try different methods by writing a new sub and plugging it in.
Update: Eily also makes the same comment about statistics.
bw, bliako
| [reply] [d/l] [select] |
|
Thanks for your response. I don't think the bioinformatics tool HMMER3 (making and using HMMs) can generate a whole shuffled genome sequence based on reading in its intact version.
But if you know of any software package that can do this, that would be good to know
About frequencies of 1 letters, 2 letters etc., REALLY good question - but I think I am happy with simple shuffling, without consideration for frequencies of 2 letters or higher order, because there is no theoretical or empirical indication that it would make a difference while assessing background signal from these shuffled sequences. But I might run that experiment in the future, for curiosity sake :)
| [reply] |
|
I have done an investigation of my own (on file ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/CHR_20/hs_ref_GRCh38.p12_chr20.fa.gz).
Unless my method for calculating the frequencies is wrong or my script has a bug (quite likely actually), empirically at least, bases like to be next to some bases more than others, and the more they gather the more fussy they become. Almost a tenfold! Also note that this is for just 1 chromosome of HS. Maybe overall the differences even out which I don't believe.
CG => 0.0121
TT => 0.0902
TAC => 0.0098
TGA => 0.0195
Results when examining pairs of letters:
Results when examining triplets of letters:
I will post the script I used shortly on the Meditations section as it may be of more general use.
bw, bliako | [reply] [d/l] [select] |
|
@bliako - thank you for your investigation.
I think I misspoke when I said "there is no theoretical or empirical indication that it would make a difference while assessing background signal from these shuffled sequences"
What should I have instead said is that "periodicities and abundance biases of di-, tri-, or multi-mer sequences when completely destroyed are more likely to be truly random shuffles, rather than shuffles that retain frequencies of di-mer, or tri-mer etc. This is obvious from your own empirical observation from one human chromosome
Anyways, I have an old script that uses shuffle from List::Util Perl module and I remember it never ran out of RAM, so I think I will use that. Though I would have preferred something that concats the entirety of the genome and then shuffles and fragments as per length distribution of chromosomes in input - but various Perl scripts that do this run out of RAM even on my cluster. Need to take a weekend break to figure out RAM usage and fix it.
| [reply] |
|
|
|
Re: Reduce RAM required
by hdb (Monsignor) on Jan 08, 2019 at 10:35 UTC
|
It looks to me as if - within each $window - you only need to count the occurrence of each of the 4 letters and then write a random sequence of those letters with the correct frequency of each. This only needs a bit of book keeping when reading and writing the files but should consume far less memory in the order of 4 times the number of windows.
But then I may not really have understood your shuffling requirements...
| [reply] [d/l] |
|
I agree with your approach of counting letters and using their frequencies
About frequency of A/T/C/G agree again - but but I would much prefer to calculate it across ALL sequences, rather than just 1 sequence at a time - corresponding to one $ID, so that this frequency-compliant sequence randomization is based on global rather than local frequencies. I suspect it will result in even more reduced signal / noise ratio, though I have not tested that yet... Thank you!
| [reply] |
Re: Reduce RAM required
by tybalt89 (Monsignor) on Jan 09, 2019 at 00:58 UTC
|
#!/usr/bin/perl
# https://perlmonks.org/?node_id=1228191
use strict;
use warnings;
my $window = 1e6;
my $A = my $C = my $G = my $T = my $all = 0;
my (@sizes, $tmp, $start);
sub letter
{
my $n = rand $all--;
$n < $A ? ($A--, return 'a') :
$n < $A + $C ? ($C--, return 'c') :
$n < $A + $C + $G ? ($G--, return 'g') :
($T--, return 't');
}
sub output
{
for my $count ( @sizes )
{
print ">ID", $start++, "\n";
print map(letter(), 1 .. $count), "\n";
}
@sizes = ();
}
while( <DATA> )
{
$start //= s/\D+//gr;
$_ = <DATA>;
$A += tr/a//;
$C += tr/c//;
$G += tr/g//;
$T += tr/t//;
$all += $tmp = tr/acgt//;
push @sizes, $tmp;
$all >= $window and output();
}
$all and output();
__DATA__
>ID1
atgagcagctag
>ID2
tggacgagctgaca
>ID3
tgactagacggacatac
| [reply] [d/l] |
|
Thank you @ tybalt89. This should work. As I mentioned in my original post, my scripting is very rusty after nearly a 4 year gap. I understand only bits and pieces of your code.
One guess is that your script requires IDs to end in numbers, which you rely on to process and split the input data? Yes, no , may be? I'd prefer the distinction between ID line and sequence line be based on whether it is preceded by ">" symbol or not, for ID and sequence, respectively, please. How should i modify the script for that?
It appears to me that the script s counting A/T/G/C for each sequence individually, correct? If not, please skip. But if yes, then how much more RAM-hungry would a modification be where the A/T/G/C count frequency across ALL sequences are first calculated BEFORE generating sequences that match those frequenciues?
Finally, how do I accept input through a FH and output to a new FH. I tried several mods to your script, but only a few worked out. Hence this request for your additional assistance. Thanks a ton!
| [reply] |
|
#!/usr/bin/perl
# https://perlmonks.org/?node_id=1228191
use strict;
use warnings;
my $window = 1e6;
my $A = my $C = my $G = my $all = 0;
my (@sizes, $tmp, $start);
my $inputfile = 'd.1228191';
my $outputfile = 'd.out.1228191';
open my $in, '<', $inputfile or die "$! opening $inputfile";
open my $out, '>', $outputfile or die "$! opening $outputfile";
sub letter
{
my $n = int rand $all--;
$n < $A ? ($A--, return 'a') :
$n < $A + $C ? ($C--, return 'c') :
$n < $A + $C + $G ? ($G--, return 'g') :
return 't';
}
sub output
{
for my $count ( @sizes )
{
print $out ">ID", $start++, "\n", map(letter(), 1 .. $count), "\n"
+;
}
@sizes = ();
}
while( <$in> )
{
if( /^>/ )
{
$start //= s/\D+//gr;
}
elsif( /^[acgt]/ )
{
$A += tr/a//;
$C += tr/c//;
$G += tr/g//;
$all += $tmp = tr/acgt//;
push @sizes, $tmp;
$all >= $window and output();
}
}
$all and output();
close $in;
close $out;
| [reply] [d/l] |
|
|
|
Re: Reduce RAM required
by swl (Parson) on Jan 08, 2019 at 22:53 UTC
|
| [reply] |
|
| [reply] |
Re: Reduce RAM required
by jimpudar (Pilgrim) on Jan 08, 2019 at 22:22 UTC
|
Maybe I'm not understanding your requirement, but it seems to me you could just read in one window's worth of sequences, shuffle them, and write them to the file. Then you could empty the window buffer and repeat until you reach EOF. This way, your memory will never grow larger than the window size.
The algorithm is a little trickier that way because you need to deal with sequences that overflow the window, but I don't think it should be too hard.
EDIT: After reading hdb's comment, I would also suggest you go that route. That seems even better than what I was describing.
πάντων χρημάτων μέτρον έστιν άνθρωπος.
| [reply] |
|
I want to obliterate signal by introducing ample noise by random shuffling. But think about it - if I bring distant regions of the sequences together and then shuffle them (or shuffle them and then bring them together), then it is more of global rather than local shuffling - which is something I prefer. You agree with my approach?
| [reply] |
|
|