in reply to Re: RFC: 100 PDL Exercises (ported from numpy)
in thread RFC: 100 PDL Exercises (ported from numpy)
Hi, bliako, thank you so much for detailed answer, my statistics skills were (hopefully) autovivified :). After following your links and code in earnest, I felt brave enough to make some experiments and write a comment, but in the process I discovered something strange ;).

First, my impression is that solutions to exercices were supposed to be simple (as, KISS). So, perhaps to translate, almost verbatim, Python solution to PDL, answer to #100 can be:
use strict; use warnings; use feature 'say'; use PDL; my $n = 100; # input sample size my $m = 1000; # number of bootstrap repeats my $r = $n; # resample size my $x = random $n; my $idx = random $r, $m; $idx *= $n; say $x> index( $idx ) > avgover > pctover( pdl 0.05, 0.95 ); __END__ [ 0.4608755 0.55562806]
Interesting, here, PDL DWIMs for me  no need to floor an index to thread over a piddle (just as with Perl's array indices). I also stand corrected in "floor converts to Long inplace"  it rounds inplace, but piddle stays Double.
This 'never to explicitly loop in vectorized language' answer, unfortunately, hides the ugly truth that for very large data we can end with huge R x M matrices of random indices and equally huge (equally unnecessary) matrices of all resamplings, and thus die because of 'Out of memory!'.
I was experimenting with this or that (PDL's automatic parallelization, in particular), which I'm skipping now, because next is something weird.
Consider this version of the above, which avoids 2dimensional index matrix and results of resamplings, but is still unparallel:
use strict; use warnings; use feature 'say'; use Time::HiRes 'time'; use PDL; srand( 123 ); my $time = time; my $n = 30000; # input sample size my $m = 10000; # number of bootstrap repeats my $r = $n; # resample size my $x = random $n; my $avg = zeroes $m; for ( 0 .. $m  1 ) { my $idx = random $r; $idx *= $n; $avg> set( $_, $x> index( $idx )> avg ) } say $avg> pctover( pdl 0.05, 0.95 ); say time  $time; __END__ [0.49384165 0.49941814] 6.11959099769592
Next is solution where I'm starting to try to parallelize, but because of selected parameters (single thread) I'm not only expecting no gain, but due to overhead it must be slower. And yet:
use strict; use warnings; use feature 'say'; use Time::HiRes 'time'; use PDL; use PDL::Parallel::threads qw/ share_pdls retrieve_pdls /; srand( 123 ); my $time = time; my $n = 30000; # input sample size my $m = 10000; # number of bootstrap repeats my $r = $n; # resample size my $x = random $n; my $avg = zeroes $m; share_pdls x => $x, avg => $avg; threads> create( sub { my ( $x, $avg ) = retrieve_pdls qw/ x avg /; for ( 0 .. $m  1 ) { my $idx = random $r; $idx *= $n; $avg> set( $_, $x> index( $idx )> avg ) } }); $_> join for threads> list; say $avg> pctover( pdl 0.05, 0.95 ); say time  $time; __END__ [0.49384165 0.49941814] 4.57857203483582
Why is that? :) I tried to insert
use PDL::Parallel::threads qw/ share_pdls retrieve_pdls /; share_pdls x => $x, avg => $avg; ( $x, $avg ) = retrieve_pdls qw/ x avg /;
into nothreads solution (does retrieve_pdls set any flags that speed things up? Nope.)
$ perl v This is perl 5, version 26, subversion 1 (v5.26.1) built for x86_64li +nuxthreadmulti (with 1 registered patch, see perl V for more detail) $ perl MPDL E 'say $PDL::VERSION' 2.019


Replies are listed 'Best First'.  

Re^3: RFC: 100 PDL Exercises (ported from numpy)
by bliako (Prior) on May 12, 2018 at 13:57 UTC  
by vr (Curate) on May 14, 2018 at 10:25 UTC  
Re^3: RFC: 100 PDL Exercises (ported from numpy)
by marioroy (Vicar) on Sep 03, 2019 at 05:15 UTC  
by marioroy (Vicar) on Sep 03, 2019 at 05:25 UTC 