Re: RFC: 100 PDL Exercises (ported from numpy)

in reply to RFC: 100 PDL Exercises (ported from numpy)

Good idea!

Here is my take on #100 (TODO)

The long version as a standalone program:

#!/usr/bin/env perl

use strict;
use warnings;

use PDL;
use PDL::GSL::RNG;
use PDL::Stats::Basic;
use PDL::Ufunc;

####
# program to get an input data vector (1D)
# and calculate its statistics (mean, stdev)
# and confidence intervals using Bootstrap.
# For demonstration the input data is generated randomly
# with specified mean/stdev from Gaussian distribution.
# Alternatively one can populate the input as sees fit.
# Improving accuracy means increasing the re-sample size
# as well as the number of bootstrap iterations (R and M)
#
# Bootstrap is a non-parametric method to estimate sample
# statistics (unlike a t-test for example which ASSUMES that our
# input data comes from a Gaussian distribution) and it is roughly bas
+ed
# on re-sampling our data many times (this is where the computer comes
# in and why this method is popular today) and each time
# calculating the statistics (mean, stdev, etc).
# Q: What is a confidence interval (CI)?
# Without getting into too much detail, a CI (as a range of sample val
+ues)
# of Y% is a property
# of our sample which says that if you draw random items from
# your sample, their value will be within the CI, Y% of the times.
# See also examples here: https://www.investopedia.com/terms/c/confide
+nceinterval.asp
# more info: http://www.stat.wmich.edu/s160/book/node48.html
#
# author: bliako
# date: 03/05/2018
####

my $time_started = time;

# RNG seed or use time_started
my $seed = 1234; # $time_started;

# size of my input sample / data
my $N = 10000;
# number of bootstrap repeats (re-samplings)
my $M = 1000;
# number of re-sample size set as the same as input size
my $R = $N;

print "$0 : starting, N=$N, M=$M, R=$R\n";
# confidence intervals to calculate:
my $confidence_intervals_levels = [5/100, 95/100];

# ad-hoc mean and stdev of our input data
my $dist_mean = pdl(10.35);
# increasing stdev will expand CI
my $dist_stdev = pdl(1.231);

# produce the Random Number Generator to generate random data
# to populate our input sample (just for demo):
my $rng = PDL::GSL::RNG->new('taus')->set_seed($seed);

# produce a random sample from a Gaussian (Normal) distribution
# with specified stdev and mean:
my $X = zeroes($N);
$rng->ran_gaussian($dist_stdev, $X);
$X += $dist_mean;

my $input_data_statistics = [
    PDL::Ufunc::avg($X),
    PDL::Stats::Basic::stdv($X)
];
print "input data statistics:"
    ."\n\tmean: ".$input_data_statistics->[0]
    ."\n\tstdev: ".$input_data_statistics->[1]
."\n";

# Now we will re-sample our input $M times and each time
# we will record the statistic of interest,
# e.g. for demo this will be MEAN (=average)

# arrays to store the results of each iteration wrt mean and stdev:
my $means = zeroes($M);
my $stdevs = zeroes($M);
for(my $i=0;$i<$M;$i++){
    # get a re-sample from original data with size R
    # use our RNG to do the re-sampling:
    my $asample = resample($X, $rng, $R);
    $means->set($i, PDL::Ufunc::avg($asample));
    $stdevs->set($i, PDL::Stats::Basic::stdv($asample));
}
my $estimated_statistics = [
    PDL::Ufunc::avg($means),
    PDL::Ufunc::avg($stdevs)
];
my $discrepancies = [
    100 * abs($estimated_statistics->[0] - $input_data_statistics->[0]
+)/$input_data_statistics->[0],
    100 * abs($estimated_statistics->[1] - $input_data_statistics->[1]
+)/$input_data_statistics->[1]
];

print "estimated statistics by bootstrap (after $M re-samplings):"
    ."\n\tmean: ".$estimated_statistics->[0]." (discrepancy: ".$discre
+pancies->[0]." %)"
    ."\n\tstdev: ".$estimated_statistics->[1]." (discrepancy: ".$discr
+epancies->[1]." %)"
."\n";

# now sort the means vector and pick the elements at
# the confidence intervals specified, e.g. 5%
my $sorted_means = PDL::Ufunc::qsort($means);
my $confidence_intervals_values = [
    $sorted_means->at(int($confidence_intervals_levels->[0]*$M)),
    $sorted_means->at(int($confidence_intervals_levels->[1]*$M))
];

print "confidence intervals after bootstraping $M times with a sample 
+size of $N each time:\n"
    ."\t".$confidence_intervals_levels->[0]." => ".$confidence_interva
+ls_values->[0]."\n"
    ."\t".$confidence_intervals_levels->[1]." => ".$confidence_interva
+ls_values->[1]."\n"
;

print "done in ".(time-$time_started)." seconds.\n";

# get a random sample with replacement from input vector
# 2nd param is a RNG object which will select N random
# elements (with replacement) from input vector
# 3rd param is OPTIONAL: the size of the sample,
# default is the size of the input vector
sub    resample {
    my $original_sample = $_[0];
    my $arng = $_[1];
    my $No = $_[2] || $original_sample->getdim(0);

    my $indices = ($No * $rng->get_uniform($No))->floor();
    my $newsample = $original_sample->slice($indices);
    return $newsample;
}
[download]

The short version:

# get a random sample with replacement from input vector
# 2nd param is a RNG object which will select N random
# elements (with replacement) from input vector
# 3rd param is OPTIONAL: the size of the sample,
# default is the size of the input vector
sub    resample {
    my $original_sample = $_[0];
    my $arng = $_[1];
    my $No = $_[2] || $original_sample->getdim(0);

    my $indices = ($No * $rng->get_uniform($No))->floor();
    my $newsample = $original_sample->slice($indices);
    return $newsample;
}

my $M = 1000; # num bootstrap resamplings
my $X = ... ; # input data piddle (1D)
my $R = $X->getdim(0); # size of bootstrap resamples
my $rng = PDL::GSL::RNG->new('taus')->set_seed($seed);
my $means = zeroes($M);
for(my $i=0;$i<$M;$i++){
    # get a re-sample from original data with size R
    # use our RNG to do the re-sampling:
    my $asample = resample($X, $rng, $R);
    $means->set($i, PDL::Ufunc::avg($asample));
}
# now sort the means vector and pick the elements at
# the confidence intervals specified, e.g. 5%
my $sorted_means = PDL::Ufunc::qsort($means);
my $confidence_intervals_values = [
    $sorted_means->at(int(0.05*$M)),
    $sorted_means->at(int(0.95*$M))
];
[download]

EDIT 1:

The above script can easily be parallelised. Input data is readonly (X). Each worker writes results (mean, stdev) to the same piddle but at different array locations guaranteed. So, there is no need for locking afaics.

Unfortunately I can not get it to parallelise with threads because PDL::GSL::RNG seems not to like threads. The RNG is needed in order to get a random sample from the original data on every bootstrap iteration. In fact each worker/thread can have its own RNG, not a copy but a different RNG local to each thread. However, even like this I get the dreaded pointer being freed was not allocated *** set a breakpoint in malloc_error_break to debug

Any ideas?

EDIT 2: the parallelised version as a standalone program:

#!/usr/bin/env perl

use strict;
use warnings;

use threads;
use threads::shared;

use PDL;
use PDL::Stats::Basic;
use PDL::Ufunc;
use PDL::Parallel::threads;
use Thread::Queue;
# MT random number generator for thread-safe operation
# not just Math::Random::MT (which gave me problems)
# but Math::Random::MT::Auto (btw neither Math::Random::MTwist/OO work
+s - maybe my fault?)
use Math::Random::MT::Auto;

use Getopt::Long;

####
# program to get an input data vector (1D)
# and calculate its statistics (mean, stdev)
# and confidence intervals using Bootstrap.
# For demonstration the input data is generated randomly
# with specified mean/stdev from Gaussian distribution.
# Alternatively one can populate the input as sees fit.
# Improving accuracy means increasing the re-sample size
# as well as the number of bootstrap iterations (R and M)
#
# Bootstrap is a non-parametric method to estimate sample
# statistics (unlike a t-test for example which ASSUMES that our
# input data comes from a Gaussian distribution) and it is roughly bas
+ed
# on re-sampling our data many times (this is where the computer comes
# in and why this method is popular today) and each time
# calculating the statistics (mean, stdev, etc).
# Q: What is a confidence interval (CI)?
# Without getting into too much detail, a CI (as a range of sample val
+ues)
# of Y% is a property
# of our sample which says that if you draw random items from
# your sample, their value will be within the CI, Y% of the times.
# See also examples here: https://www.investopedia.com/terms/c/confide
+nceinterval.asp
# more info: http://www.stat.wmich.edu/s160/book/node48.html
#
# author: bliako
# date: 03/05/2018
####

my $input_filename = undef;
# default confidence intervals to calculate:
my @confidence_intervals_levels = ();
my $num_bootstrap_repeats = 1000;
my $bootstrap_sample_size = undef;
my $max_num_threads = 3;
my @demo_params;
if( ! Getopt::Long::GetOptions(
    'input|i=s' => \$input_filename,
    'confidence-intervals|c=f{2}' => \@confidence_intervals_levels,
    'num-bootstrap-repeats|R=i' => \$num_bootstrap_repeats,
    'bootstrap-sample-size|S=i' => \$bootstrap_sample_size,
    'max-num-threads=i' => \$max_num_threads,
    'demo=f{3}' => \@demo_params,
    'help' => sub { print usage($0); exit(0) }
) ){ print STDERR usage($0) . "\n\n$0 : something wrong with command l
+ine parameters.\n"; exit(1) }

my $time_started = time;
if( scalar(@confidence_intervals_levels) == 0 ){ @confidence_intervals
+_levels = (5/100, 95/100) }

my ($i, $N, $line);
# my input data piddle
my $X;

# produce the Random Number Generator to generate random data
# to populate our input sample (just for demo):
my $rngMT = Math::Random::MT::Auto->new();
if( ! defined($rngMT) ){ print STDERR "$0 : call to ".'Math::Random::M
+T::Auto->new()'." has failed.\n"; exit(1) }

if( defined($input_filename) ){
    # read data from file:
    if( ! open(INP, '<', $input_filename) ){ print STDERR "$0 : could 
+not open input file '$input_filename', $!\n"; exit(1) }
    $i = 0;
    while( $line = <INP> ){
        chomp($line);
        if( ! ($line =~ /^\s*$/) ){ $i++; }
    }
    seek(INP, 0, 0); # rewind filehandle
    $i = 0;
    while( $line = <INP> ){
        chomp($line);
        if( ! ($line =~ /^\s*$/) ){ $X->set($i++, $line) }
    }
    close(INP);
} elsif( scalar(@demo_params) == 3 ){
    # produce a random sample from a Gaussian (Normal) distribution
    # with specified stdev and mean for demo
    $N = $demo_params[2];
    print "$0 : creating random data of $N items with mean = ".$demo_p
+arams[0]." and stdev = ".$demo_params[1]." ...\n";
    $X = zeroes($N);
    # populates above piddle with rands from gaussian dist with specif
+ied mean/stdev
    # using the RNG obj provided
    gaussian_rand({
        'mean' => $demo_params[0],
        'stdev' => $demo_params[1],
        'piddle' => $X,
        'rng' => $rngMT
    });
} else {
    print STDERR usage($0) . "\n$0 : either an input file must be spec
+ified (--input) or use the demo switch as : --demo MEAN STDEV SAMPLE_
+SIZE\n";
    exit(1)
}
# size of input data
$N = $X->getdim(0);
$bootstrap_sample_size = $N unless defined($bootstrap_sample_size);

print "$0 : starting with\n\tinput sample size: $N\n\tbootstrap repeat
+s: $num_bootstrap_repeats\n\tbootstrap sample size: $bootstrap_sample
+_size\n\tparallelising over $max_num_threads threads.\n";

my $input_data_statistics = [
    PDL::Ufunc::avg($X),
    PDL::Stats::Basic::stdv($X)
];
print "$0 : input data statistics:"
    ."\n\tmean: ".$input_data_statistics->[0]
    ."\n\tstdev: ".$input_data_statistics->[1]
."\n";

# Now we will re-sample our input $bootstrap_sample_size times and eac
+h time
# we will record the statistic of interest,
# e.g. for demo this will be MEAN (=average)

# arrays to store the results of each iteration wrt mean and stdev:
# size will be equal to number of repeats
my $means = zeroes($num_bootstrap_repeats);
my $stdevs = zeroes($num_bootstrap_repeats);
# we need to share these
$X->share_as('X');
$means->share_as('means');
$stdevs->share_as('stdevs');
my $Q = Thread::Queue->new();
sub    worker {
    my $tid = threads->tid;
    my $worker_rngMT = Math::Random::MT::Auto->new();
    if( ! defined($worker_rngMT) ){ print STDERR "$0 : call to ".'Math
+::Random::MT::Auto->new()'." has failed for worker $tid.\n"; exit(1) 
+}
    while( defined(my $item = $Q->dequeue())){
        my $shallow_copy_of_X = PDL::Parallel::threads::retrieve_pdls(
+'X');
        my $shallow_copy_of_means = PDL::Parallel::threads::retrieve_p
+dls('means');
        my $shallow_copy_of_stdevs = PDL::Parallel::threads::retrieve_
+pdls('stdevs');
        my $asample = resample(
            $shallow_copy_of_X,
            $worker_rngMT,
            $bootstrap_sample_size
        );
        my $amean = PDL::Ufunc::avg($asample);
        my $astdev = PDL::Stats::Basic::stdv($asample);
        $shallow_copy_of_means->set($item, $amean);
        $shallow_copy_of_stdevs->set($item, $astdev);
        #print "I am worker $tid and getting index=$item : $amean, $as
+tdev\n";
    }
    print "worker $tid is ending\n";
} # worker sub

my @tpool = map{
    threads->create(\&worker, $Q)
} 1 .. $max_num_threads;

for(my $i=0;$i<$num_bootstrap_repeats;$i++){
    $Q->enqueue($i);
}
$Q->end();
$_->join for @tpool;
#exit(0);
my $estimated_statistics = [
    PDL::Ufunc::avg($means),
    PDL::Ufunc::avg($stdevs)
];
=pod
my $discrepancies = [
    100 * abs($estimated_statistics->[0] - $input_data_statistics->[0]
+)/$input_data_statistics->[0],
    100 * abs($estimated_statistics->[1] - $input_data_statistics->[1]
+)/$input_data_statistics->[1]
];

print "estimated statistics by bootstrap (after $M re-samplings):"
    ."\n\tmean: ".$estimated_statistics->[0]." (discrepancy: ".$discre
+pancies->[0]." %)"
    ."\n\tstdev: ".$estimated_statistics->[1]." (discrepancy: ".$discr
+epancies->[1]." %)"
."\n";
=cut

# now sort the means vector and pick the elements at
# the confidence intervals specified, e.g. 5%
my $sorted_means = PDL::Ufunc::qsort($means);
my @confidence_intervals_values = (
    $sorted_means->at(int($confidence_intervals_levels[0]*$num_bootstr
+ap_repeats)),
    $sorted_means->at(int($confidence_intervals_levels[1]*$num_bootstr
+ap_repeats))
);

print "confidence intervals after bootstraping $num_bootstrap_repeats 
+times with a sample size of $N each time:\n"
    ."\t".$confidence_intervals_levels[0]." => ".$confidence_intervals
+_values[0]."\n"
    ."\t".$confidence_intervals_levels[1]." => ".$confidence_intervals
+_values[1]."\n"
;

print "done in ".(time-$time_started)." seconds.\n";

exit(0);

sub    usage {
    return "Usage : ".$_[0]." --input file.txt | --demo MEAN STDEV SAM
+PLE_SIZE [--confidence-intervals LEFT RIGHT (e.g. 0.05 0.95 for 5% an
+d 95%)] [--num-bootstrap-repeats N] [--bootstrap-sample-size S (if no
+t specified it will be the same as input sample size)] [--max-num-thr
+eads P]\nExample:\n".$_[0]." --demo 10.32 1.23 1000 --max-num-threads
+ 4 --confidence-intervals 0.075 0.925\n"
}
# get a random sample with replacement from input vector
# 2nd param is a RNG object which will select N random
# elements (with replacement) from input vector
# 3rd param is OPTIONAL: the size of the sample,
# default is the size of the input vector
sub    resample {
    my $original_sample = $_[0];
    my $arng = $_[1]; # RNG to give us random indices to sample
    my $No = $_[2] || $original_sample->getdim(0); # size of sample

    my $newsample = zeroes($No);
    for(my $i=$No;$i-->0;){ $newsample->set($i, $original_sample->at(f
+loor($No * $arng->rand()))) }
    return $newsample;
}
# returns n random numbers from the gaussian (normal) distribution
# with mean and stdev optionally specified. Default values are mean=0,
+ stdev=1
# An rng object which exports rand([$n]) is required
sub    gaussian_rand {
    my $params = $_[0];
    my $rngobj;
    if( ! defined($rngobj=$params->{'rng'}) ){
        # now this is weird:
        # my $rngobj = $params->{'rng'} or die ... fails once in a whi
+le!
        die "rng is required, params were: ".join(",",keys %$params)."
+ and rng was ".$params->{'rng'}."\n";
    }
    my $piddle_to_return_results = $params->{'piddle'};
    my $array_to_return_results = $params->{'array'};
    my $n = undef;
    if( defined($piddle_to_return_results) ){
        $n = $piddle_to_return_results->getdim(0);
    } else {
        if( defined($array_to_return_results) ){
            $n = scalar(@$array_to_return_results);
        } else {
            $n = $params->{'n'} or die "n is required if no array or p
+iddle references are specified.";
            $array_to_return_results = [ (0)x$n ];
        }
    }
    my ($mean, $stdev);
    if( ! defined($mean=$params->{'mean'}) ){ $mean = 0.0 }
    if( ! defined($stdev=$params->{'stdev'}) ){ $stdev = 1.0 }

    # following code is from "Perl Cookbook" (http://www.oreilly.com/c
+atalog/perlckbk2/)
    my ($u1, $u2);    # uniformly distributed random numbers
    my $w;        # variance, then a weight
    my ($g1, $g2);    # gaussian-distributed numbers

    my $i;
    for($i=0;$i<($n-1);$i+=2){
        do {
            $u1 = 2 * $rngobj->rand() - 1;
            $u2 = 2 * $rngobj->rand() - 1;
            $w = $u1*$u1 + $u2*$u2;
        } while ( $w >= 1 );
        $w = sqrt( (-2 * log($w)) / $w );
        $g2 = ($u1 * $w) * $stdev + $mean;
        $g1 = ($u2 * $w) * $stdev + $mean;
        if( defined($array_to_return_results) ){
            $array_to_return_results->[$i] = $g1;
            $array_to_return_results->[$i+1] = $g2;
        } else {
            $piddle_to_return_results->set($i, $g1);
            $piddle_to_return_results->set($i+1, $g2);
        }
    }
    if( $n % 2 == 1 ){
        $i = $n-1;
        do {
            $u1 = 2 * $rngobj->rand() - 1;
            $u2 = 2 * $rngobj->rand() - 1;
            $w = $u1*$u1 + $u2*$u2;
        } while ( $w >= 1 );
        $w = sqrt( (-2 * log($w)) / $w );
        $g2 = ($u1 * $w) * $stdev + $mean;
        $g1 = ($u2 * $w) * $stdev + $mean;
        if( defined($array_to_return_results) ){
            $array_to_return_results->[$i] = $g1;
        } else {
            $piddle_to_return_results->set($i, $g1);
        }
    }
    return $array_to_return_results;
}
[download]

corrections welcome (especially on good practices about how to parallelise)

Comment on Re: RFC: 100 PDL Exercises (ported from numpy) Select or Download Code

Replies are listed 'Best First'.
Re^2: RFC: 100 PDL Exercises (ported from numpy) by vr (Curate) on May 08, 2018 at 16:58 UTC
Hi, bliako, thank you so much for detailed answer, my statistics skills were (hopefully) auto-vivified :). After following your links and code in earnest, I felt brave enough to make some experiments and write a comment, but in the process I discovered something strange ;). ---------- First, my impression is that solutions to exercices were supposed to be simple (as, KISS). So, perhaps to translate, almost verbatim, Python solution to PDL, answer to #100 can be: `use strict; use warnings; use feature 'say'; use PDL; my $n = 100; # input sample size my $m = 1000; # number of bootstrap repeats my $r = $n; # re-sample size my $x = random $n; my $idx = random $r, $m; $idx = $n; say $x-> index( $idx ) -> avgover -> pctover( pdl 0.05, 0.95 ); __END__ [ 0.4608755 0.55562806]` [download] Interesting, here, PDL DWIMs for me -- no need to `floor` an index to thread over a piddle (just as with Perl's array indices). I also stand corrected in "`floor` converts to Long in-place" -- it rounds in-place, but piddle stays Double. This 'never to explicitly loop in vectorized language' answer, unfortunately, hides the ugly truth that for very large data we can end with huge R x M matrices of random indices and equally huge (equally unnecessary) matrices of all re-samplings, and thus die because of 'Out of memory!'. I was experimenting with this or that (PDL's automatic parallelization, in particular), which I'm skipping now, because next is something weird. Consider this version of the above, which avoids 2-dimensional index matrix and results of re-samplings, but is still un-parallel: `use strict; use warnings; use feature 'say'; use Time::HiRes 'time'; use PDL; srand( 123 ); my $time = time; my $n = 30000; # input sample size my $m = 10000; # number of bootstrap repeats my $r = $n; # re-sample size my $x = random $n; my $avg = zeroes $m; for ( 0 .. $m - 1 ) { my $idx = random $r; $idx = $n; $avg-> set( $_, $x-> index( $idx )-> avg ) } say $avg-> pctover( pdl 0.05, 0.95 ); say time - $time; __END__ [0.49384165 0.49941814] 6.11959099769592` [download] Next is solution where I'm starting to try to parallelize, but because of selected parameters (single thread) I'm not only expecting no gain, but due to overhead it must be slower. And yet: use strict; use warnings; use feature 'say'; use Time::HiRes 'time'; use PDL; use PDL::Parallel::threads qw/ share_pdls retrieve_pdls /; srand( 123 ); my $time = time; my $n = 30000; # input sample size my $m = 10000; # number of bootstrap repeats my $r = $n; # re-sample size my $x = random $n; my $avg = zeroes $m; share_pdls x => $x, avg => $avg; threads-> create( sub { my ( $x, $avg ) = retrieve_pdls qw/ x avg /; for ( 0 .. $m - 1 ) { my $idx = random $r; $idx *= $n; $avg-> set( $_, $x-> index( $idx )-> avg ) } }); $_-> join for threads-> list; say $avg-> pctover( pdl 0.05, 0.95 ); say time - $time; __END__ [0.49384165 0.49941814] 4.57857203483582 [download] Why is that? :) I tried to insert `use PDL::Parallel::threads qw/ share_pdls retrieve_pdls /; share_pdls x => $x, avg => $avg; ( $x, $avg ) = retrieve_pdls qw/ x avg /;` [download] into no-threads solution (does `retrieve_pdls` set any flags that speed things up? Nope.) `$ perl -v This is perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-li +nux-thread-multi (with 1 registered patch, see perl -V for more detail) $ perl -MPDL -E 'say $PDL::VERSION' 2.019` [download]	[reply] [d/l] [select]
Re^3: RFC: 100 PDL Exercises (ported from numpy) by bliako (Monsignor) on May 12, 2018 at 13:57 UTC
vr your code is superior than the long code I have posted! If I may add: using `oddpctover()` might be preferred because it does not interpolate when there is no data at the exact percentile position. Regarding the time difference when running with and without "use threads", I have discovered that avg() is the culprit. If you use `x-> index( $idx )->at(0)` rather than `x-> index( $idx )->avg` the performance is the same (which means idx() is also excluded as possible cause).	[reply] [d/l] [select]
Re^4: RFC: 100 PDL Exercises (ported from numpy) by vr (Curate) on May 14, 2018 at 10:25 UTC
Regarding the time difference when running with and without "use threads", I have discovered that avg() is the culprit. Hm-m, it's not what I'm observing here. Setting `$m = 30_000` and replacing `avg` with `at(0)`, these are results of 3 runs without thread and in a thread: `14.9562268257141 14.9582891464233 14.8853561878204 11.9686307907104 12.0527169704437 12.0850310325623` [download] And then replacing all 3 lines of loop block with just simple `my $y = $x-> index( sequence $n )-> at( 0 );` [download] `7.74871516227722 7.80155396461487 7.71721601486206 4.92977499961853 4.87044596672058 4.87968802452087` [download] So I'd say it's something strange with `index` going on. I'm not appealing to anyone for investigation :), observable speed difference may depend very much on hardware, let it be a murky PDL mystery. And another entertaining (but dangerous, maybe) bit of PDL trivia: above I sketched a parallelization solution with threads and `random`. Except, if there were more than 1 worker thread, it won't work as expected. The `random` documentation says Perl's `srand` can be used to seed, and one may assume that `random` relies on Perl's RNG, including it (automatically) seeds in each thread that's started. Consider: `use strict; use warnings; use feature 'say'; use threads; use PDL; PDL::no_clone_skip_warning; srand; async sub{ say random( 1 )-> at( 0 )} for 1 .. 5; $_-> join for threads-> list; say rand; __END__ 0.851411183904023 0.851411183904023 0.851411183904023 0.851411183904023 0.851411183904023 0.851411183904023` [download] That's some randomness. Try to `say rand` instead. So, one must to explicitly call `srand` at the start of a thread, if using threads and PDL's `random`.	[reply] [d/l] [select]
Re^3: RFC: 100 PDL Exercises (ported from numpy) by marioroy (Prior) on Sep 03, 2019 at 05:15 UTC
Hi, vr Tonight came across your post and modified your demonstration to run with 4 threads. # https://www.perlmonks.org/?node_id=1214227 use strict; use warnings; use feature 'say'; use PDL; use PDL::Parallel::threads qw(retrieve_pdls); use threads; use MCE::Shared; use Time::HiRes 'time'; srand( 123 ); my $time = time; my $n = 30000; # input sample size my $m = 10000; # number of bootstrap repeats my $r = $n; # re-sample size my $x = random( $n ); $x->share_as('x'); my $avg = zeroes( $m ); $avg->share_as('avg'); my $seq = MCE::Shared->sequence( 0, $m - 1 ); sub parallel_task { srand; my ( $x, $avg ) = retrieve_pdls('x', 'avg'); while ( defined ( my $seq_n = $seq->next() ) ) { my $idx = random $r; $idx = $n; $avg->set( $seq_n, $x->index( $idx )->avg ); } } threads->create( \&parallel_task ) for 1 .. 4; # ... do other stuff ... $_->join() for threads->list(); say $avg->pctover( pdl 0.05, 0.95 ); say time - $time, ' seconds'; __END__ # Output [0.49395242 0.49936752] 1.28744792938232 seconds [download] Afterwards, re-validated PDL with MCE and released 1.847. The effort is mainly for folks running Perl lacking threads support. Here it is, PDL and MCE::Shared running similarly. # https://www.perlmonks.org/?node_id=1214227 use strict; use warnings; use feature 'say'; use PDL; # must load PDL before MCE::Shared use MCE::Hobo; use MCE::Shared 1.847; use Time::HiRes 'time'; srand( 123 ); my $time = time; my $n = 30000; # input sample size my $m = 10000; # number of bootstrap repeats my $r = $n; # re-sample size # On Windows, the non-shared piddle ($x) is unblessed in threads. # Therefore, constructing the piddle inside the worker. # UNIX platforms benefit from copy-on-write. Thus, one copy. my $x = ( $^O eq 'MSWin32' ) ? undef : random( $n ); my $avg = MCE::Shared->pdl_zeroes( $m ); my $seq = MCE::Shared->sequence( 0, $m - 1 ); sub parallel_task { $x = random( $n ) unless ( defined $x ); while ( defined ( my $seq_n = $seq->next() ) ) { my $idx = random $r; $idx = $n; # $avg is a shared piddle which resides inside the shared- # manager process or thread. The piddle is accessible via the # OO interface only. $avg->set( $seq_n, $x->index( $idx )->avg ); } } MCE::Hobo->create( \&parallel_task ) for 1 .. 4; # ... do other stuff ... MCE::Hobo->wait_all(); # MCE sets the seed of the base generator uniquely between workers. # Unfortunately, it requires running with one worker for predictable # results (i.e. no guarantee in the order which worker computes the # next input chunk). say $avg->pctover( pdl 0.05, 0.95 ); say time - $time, ' seconds'; __END__ # Output [0.49387191 0.49937053] 1.29038286209106 seconds [download] Regards, Mario	[reply] [d/l] [select]
Re^4: RFC: 100 PDL Exercises (ported from numpy) by marioroy (Prior) on Sep 03, 2019 at 05:25 UTC
Here is the same thing using MCE. Workers obtain the next sequence number without involving the manager process. Thus, the reason why it runs faster. I had to think about it when I saw the run time. # https://www.perlmonks.org/?node_id=1214227 use strict; use warnings; use feature 'say'; use PDL; # must load PDL before MCE::Shared use MCE 1.847; use MCE::Shared 1.847; use Time::HiRes 'time'; srand( 123 ); my $time = time; my $n = 30000; # input sample size my $m = 10000; # number of bootstrap repeats my $r = $n; # re-sample size # On Windows, the non-shared piddle ($x) is unblessed in threads. # Therefore, constructing the piddle inside the worker. # UNIX platforms benefit from copy-on-write. Thus, one copy. my $x = ( $^O eq 'MSWin32' ) ? undef : random( $n ); my $avg = MCE::Shared->pdl_zeroes( $m ); MCE->new( max_workers => 4, sequence => [ 0, $m - 1 ], chunk_size => 1, user_begin => sub { $x = random( $n ) unless ( defined $x ); }, user_func => sub { my $idx = random $r; $idx *= $n; # $avg is a shared piddle which resides inside the shared- # manager process or thread. The piddle is accessible via the # OO interface only. $avg->set( $_, $x->index( $idx )->avg ); } )->run; # MCE sets the seed of the base generator uniquely between workers. # Unfortunately, it requires running with one worker for predictable # results (i.e. no guarantee in the order which worker computes the # next input chunk). say $avg->pctover( pdl 0.05, 0.95 ); say time - $time, ' seconds'; __END__ # Output [0.49387106 0.4993768] 1.09556317329407 seconds [download] Thank you, vr. I had no idea that PDL random is not unique between threads. MCE already sets the seed of the base generator, but did not do so for workers spawned as threads. This is resolved in MCE 1.847. Regards, Mario	[reply] [d/l]
Re^5: RFC: 100 PDL Exercises (ported from numpy) by etj (Deacon) on Mar 16, 2022 at 21:00 UTC
Re^3: RFC: 100 PDL Exercises (ported from numpy) by etj (Deacon) on May 01, 2022 at 16:50 UTC
This 'never to explicitly loop in vectorized language' answer, unfortunately, hides the ugly truth that for very large data we can end with huge R x M matrices of random indices and equally huge (equally unnecessary) matrices of all re-samplings, and thus die because of 'Out of memory!'. This can be a real issue! Keeping such index-sets around for exactly the right amount of time will help, which in turn might be assisted by making little functions that do individual operations on subsets (which a captured index-set ndarray which goes out of scope on finishing). Another thing that will help is the forthcoming loop fusion, discussed at https://github.com/PDLPorters/pdl/issues/349: non-slice operations will become lazy, and on evaluation will potentially get put together into a new, loop-fused operation. Something else that would help here, as also discussed on #349, is more generalised first-class "index operations". Ideas and contributions here, on the GitHub issue, on the PDL mailing lists, or any other means are most welcome! One other thought is that, for larger ndarrays (because POSIX threads have a startup cost), the use of vectorised operations is the way to harness multiple cores for free (for operations that support this), which a Perl for-loop cannot achieve.	[reply]
Re^2: RFC: 100 PDL Exercises (ported from numpy) by mxb (Pilgrim) on May 04, 2018 at 07:39 UTC
Hi While I cannot assist with your parallisation issue, many thanks for your contribution. I hope someone else with more experience may be able to assist. Thanks	[reply]

In Section Meditations