Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^4: Reduce RAM required

by vr (Curate)
on Jan 11, 2019 at 10:51 UTC ( [id://1228388]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Reduce RAM required
in thread Reduce RAM required

I would have preferred something that concats the entirety of the genome and then shuffles and fragments as per length distribution of chromosomes in input

Why? Why not consume only as much data as required per current "length" (btw, are they really constant (1e6)?), and then dispose of data no longer needed as you move along? Reversing every second fragment before shuffling is kind of mad science... (well, in positive sense)

Here's a take, that, while "concatenating in entirety", tries hard not to allocate any more memory. And even less if id_lines can be re-built during output, i.e. not stored. Though I didn't profile.

The problem I found interesting for myself to look into was this: if there's huge chunk of bytes to, e.g., shuffle, why split or make copies or construct huge Perl lists etc. I hope code below shuffles "in place". C code is more or less ripped from PDL::API. It's puzzling to me why is it that "wrapping existing data into piddle in-place" exists (for 20+ years?) as somewhat obscure part of documentation and not implemented in pure Perl.

The RNG ('taus') was chosen arbitrarily because of synopsis, there are plenty of others to choose from, more fun than simple "fragment reversing", so have a look :)

use strict; use warnings; use PDL; use Inline with => 'PDL'; use PDL::GSL::RNG; my $genome = ''; my @id_lines; my @runs; ########################## # Read ########################## while ( <DATA> ) { chomp; if ( /^>/ ) { push @id_lines, $_ } else { push @runs, length; $genome .= $_ } } ########################## # Shuffle ########################## my $rng = PDL::GSL::RNG-> new( 'taus' ); $rng-> set_seed( time ); my $start = 0; my $stop = length( $genome ) - 1; my $window = 3; while ( $start < $stop ) { my $len = $start + $window > $stop ? $stop - $start : $window; my $p = mkpiddle( $genome, $start, $len ); $rng-> ran_shuffle( $p ); $start += $window } ########################## # Output ########################## $start = 0; for ( 0 .. $#runs ) { print $id_lines[ $_ ], "\n", substr( $genome, $start, $runs[ $_ ]), "\n"; $start += $runs[ $_ ] } ########################## # Guts ########################## use Inline C => <<'END_OF_C'; static void default_magic( pdl *p, int pa ) { p-> data = 0; } pdl* mkpiddle( char* data, int ofs, int len ) { PDL_Indx dims[] = { len }; pdl* npdl = PDL-> pdlnew(); PDL-> setdims( npdl, dims, 1 ); npdl-> datatype = PDL_B; npdl-> data = data + ofs; npdl-> state |= PDL_DONTTOUCHDATA | PDL_ALLOCATED; PDL-> add_deletedata_magic( npdl, default_magic, 0 ); return npdl; } END_OF_C __DATA__ >Chr1 CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT >Chr2 TATGACGTTTAGGGACGATCTTAATGACGTTTAGGGTTTTATCGATCAGCGACGTAGGGA >Chr3 GTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTT >Chr4 AACAAGGTACTCTCATCTCTTTACTGGGTAAATAACATATCAACTTGGACCTCATTCATA >Chr5 AACATGATTCACACCTTGATGATGTTTTTAGAGAGTTCTCGTGTGAGGCGATTCTTGAGG

Replies are listed 'Best First'.
Re^5: Reduce RAM required
by etj (Deacon) on May 02, 2022 at 18:41 UTC
    It's puzzling to me why is it that "wrapping existing data into piddle in-place" exists (for 20+ years?) as somewhat obscure part of documentation and not implemented in pure Perl.
    The easy answer to your question is that a Perl-accessible way of putting data into an ndarray has been around for decades: https://metacpan.org/pod/PDL::Core#get_dataref.

    Zooming out and wondering how to document that in a findable way seems quite tricky, and we're very open to suggestions!

      Is it "wrapping in-place"? Even if, because of COW, no data is moved in memory with this assignment:

      ${ $pdl-> get_dataref } = $scalar_eg_5_Gb_long

      it still requires pre-existance of 5 Gb, zero or garbage-filled, ndarray, or is it wrong? At some point during execution, RAM usage would peak to 10 Gb.

        That's a great point! Currently, there would indeed be an instant where both the ndarray and the input SV needed to have the full amount allocated, in part because get_dataref physicalises its ndarray (i.e. allocates its memory). COW semantics do open a bit of a can of worms, because PDL currently assumes it has full ownership of the block of memory pointed at by the PV. I am going to just hope that all works fine.

        PDL's File::Map-using code (via PDL::IO::FastRow) is an alternative approach, which would avoid the use of RAM entirely.

        This being a long-standing issue suggests to me there isn't a huge demand for it. However, I am very open to adding a PDL->from_sv method that does what your code does (and would also set the datasv member to the passed-in SV rather than use a char * and SvREFCNT_inc it) - it would also need to deal with the COW situation correctly, which I don't know how to do. Would that help? I think it would actually provide a more generalised implementation of the File::Map stuff in any case.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1228388]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-19 21:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found