I would have preferred something that concats the entirety of the genome and then shuffles and fragments as per length distribution of chromosomes in input
Why? Why not consume only as much data as required per current "length" (btw, are they really constant (1e6)?), and then dispose of data no longer needed as you move along? Reversing every second fragment before shuffling is kind of mad science... (well, in positive sense)
Here's a take, that, while "concatenating in entirety", tries hard not to allocate any more memory. And even less if id_lines can be re-built during output, i.e. not stored. Though I didn't profile.
The problem I found interesting for myself to look into was this: if there's huge chunk of bytes to, e.g., shuffle, why split or make copies or construct huge Perl lists etc. I hope code below shuffles "in place". C code is more or less ripped from PDL::API. It's puzzling to me why is it that "wrapping existing data into piddle in-place" exists (for 20+ years?) as somewhat obscure part of documentation and not implemented in pure Perl.
The RNG ('taus') was chosen arbitrarily because of synopsis, there are plenty of others to choose from, more fun than simple "fragment reversing", so have a look :)
use strict;
use warnings;
use PDL;
use Inline with => 'PDL';
use PDL::GSL::RNG;
my $genome = '';
my @id_lines;
my @runs;
##########################
# Read
##########################
while ( <DATA> ) {
chomp;
if ( /^>/ ) {
push @id_lines, $_
}
else {
push @runs, length;
$genome .= $_
}
}
##########################
# Shuffle
##########################
my $rng = PDL::GSL::RNG-> new( 'taus' );
$rng-> set_seed( time );
my $start = 0;
my $stop = length( $genome ) - 1;
my $window = 3;
while ( $start < $stop ) {
my $len = $start + $window > $stop
? $stop - $start
: $window;
my $p = mkpiddle( $genome, $start, $len );
$rng-> ran_shuffle( $p );
$start += $window
}
##########################
# Output
##########################
$start = 0;
for ( 0 .. $#runs ) {
print $id_lines[ $_ ], "\n",
substr( $genome, $start, $runs[ $_ ]), "\n";
$start += $runs[ $_ ]
}
##########################
# Guts
##########################
use Inline C => <<'END_OF_C';
static void default_magic( pdl *p, int pa ) { p-> data = 0; }
pdl* mkpiddle( char* data, int ofs, int len ) {
PDL_Indx dims[] = { len };
pdl* npdl = PDL-> pdlnew();
PDL-> setdims( npdl, dims, 1 );
npdl-> datatype = PDL_B;
npdl-> data = data + ofs;
npdl-> state |= PDL_DONTTOUCHDATA | PDL_ALLOCATED;
PDL-> add_deletedata_magic( npdl, default_magic, 0 );
return npdl;
}
END_OF_C
__DATA__
>Chr1
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT
>Chr2
TATGACGTTTAGGGACGATCTTAATGACGTTTAGGGTTTTATCGATCAGCGACGTAGGGA
>Chr3
GTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTT
>Chr4
AACAAGGTACTCTCATCTCTTTACTGGGTAAATAACATATCAACTTGGACCTCATTCATA
>Chr5
AACATGATTCACACCTTGATGATGTTTTTAGAGAGTTCTCGTGTGAGGCGATTCTTGAGG