Re^2: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL fun)

in reply to Re: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL fun)
in thread Optimizing with Caching vs. Parallelizing (MCE::Map)

Hi vr,

Lots of PDL goodness. Wow! I do not know how this will perform unless taking a moment or two and give it a try. So here is a parallel version based on your 2nd example.

use strict;
use warnings;
use feature 'say';
use PDL;
use MCE::Flow;
use MCE::Candy;

use Time::HiRes 'time';
my $t = time;

# PDL Quick Reference
# https://www.perlmonks.org/?node_id=1214437

sub collatz_seq {
    my ( $chunk_id, $seq_beg, $seq_end ) = @_;
    my $max = $seq_end - $seq_beg + 2;
    my $top = $max < 20 ? $max : 20;

    my $seqs = pdl( longlong, 1, $seq_beg..$seq_end );
    $seqs-> setbadat( 0 );
    $seqs-> badvalue( 2 );

    my $lengths = 1 + ones( short, $max );
    $lengths-> set( 0, 1 );

    while ( any my $good_mask = $seqs-> isgood ) {

        my ( $seqs_odd, $lengths_odd_masked )
            = where( $seqs, $lengths, $seqs & 1 );

        $lengths_odd_masked ++;
        $lengths-> where( $good_mask ) ++;
        ( $seqs_odd *= 3 ) ++;
        $seqs >>= 1;
    }

    my $sorted_i = $lengths-> qsorti;
    my $sorted   = $lengths-> index( $sorted_i );
    my $value    = $sorted-> at( $max - $top );
    my $pos      = vsearch_insert_leftmost( $value, $sorted );
    my $top_i    = $sorted_i-> slice([ $max - 1 , $pos ]);

    ( my $result = $lengths
        -> index( $top_i )
        -> longlong
        -> bitnot
        -> cat( $top_i + 1 )
        -> transpose
        -> qsortvec
        -> slice([], [ 0, $top - 1 ])

    )-> slice([ 0 ], [])
     -> inplace
     -> bitnot;

    # From PDL to Perl: [ 0 1 ] becomes [ 1, 0 ],
    my $str = $result->string;
    $str =~ s/(\d+)\s+(\d+)(.*)/$2,$1$3,/g;

    my $ret = eval $str;
    $_->[0] = $_->[0] + $seq_beg - 2 for @$ret;

    MCE->gather( $chunk_id, @$ret );
}

my $size = shift || 1e6;
$size = 1e6 if $size < 1e6;  # minimum
$size = 1e9 if $size > 1e9;  # maximum

my $chunk_size = $size >> 5;
my @seqs;

mce_flow_s {
    max_workers => MCE::Util::get_ncpu(),
    chunk_size  => $chunk_size > 100000 ? 100000 : $chunk_size,
    bounds_only => 1,
    gather      => MCE::Candy::out_iter_array(\@seqs),
}, sub {
    my ( $mce, $chunk_ref, $chunk_id ) = @_;
    collatz_seq( $chunk_id, @{ $chunk_ref } );
}, 2, $size;

MCE::Flow->finish;

@seqs = ( sort { $b->[1] <=> $a->[1]} @seqs )[ 0..19 ];

printf "Collatz(%5d) has sequence length of %3d steps\n", @$_
    for @seqs;

say {*STDERR} time - $t;
[download]

Results:

I was expecting for mce_pdl2 using 1 core to be closer to vr_pdl2 than vr_pdl3. Maybe benefitting from CPU L2/L3 cache. Chunking seems to be the reason why mce_pdl2 (non-caching) ran nearly as fast as vr_pdl3 (caching). Below, includes 1 core testing for 1e7 with various chunk sizes.

1e7 testing
     vr_pdl3:    46.328s  cache
     vr_pdl2:  1m16.058s  non-cache
    mce_pdl2:  1m15.583s  chunk_size => $size
    mce_pdl2:  1m03.709s  chunk_size => $size >> 1
    mce_pdl2:    52.352s  chunk_size => $size >> 2
    mce_pdl2:    51.323s  chunk_size => $size >> 3
    mce_pdl2:    49.396s  chunk_size => $size >> 4
    mce_pdl2:    48.195s  chunk_size => $size >> 5
    mce_pdl2:    48.369s  chunk_size => $size >> 6
    mce_pdl2:    48.501s  chunk_size => $size >> 7

chunk_size => 300000
    mce_pdl2:    48.195s   1 core
    mce_pdl2:    25.311s   2 cores
    mce_pdl2:    14.085s   4 cores
    mce_pdl2:     7.650s   8 cores
    mce_pdl2:     4.517s  16 cores
    mce_pdl2:     3.721s  32 cores

chunk_size => 100000
    mce_pdl2:    48.395s   1 core
    mce_pdl2:    25.402s   2 cores
    mce_pdl2:    13.163s   4 cores
    mce_pdl2:     6.860s   8 cores
    mce_pdl2:     3.850s  16 cores
    mce_pdl2:     2.347s  32 cores

Output
    Collatz(8400511) has sequence length of 686 steps
    Collatz(8865705) has sequence length of 668 steps
    Collatz(6649279) has sequence length of 665 steps
    Collatz(9973919) has sequence length of 663 steps
    Collatz(6674175) has sequence length of 621 steps
    Collatz(7332399) has sequence length of 616 steps
    Collatz(7532665) has sequence length of 616 steps
    Collatz(5649499) has sequence length of 613 steps
    Collatz(8474249) has sequence length of 611 steps
    Collatz(6355687) has sequence length of 608 steps
    Collatz(8847225) has sequence length of 606 steps
    Collatz(9533531) has sequence length of 606 steps
    Collatz(6635419) has sequence length of 603 steps
    Collatz(9953129) has sequence length of 601 steps
    Collatz(7464846) has sequence length of 598 steps
    Collatz(7464847) has sequence length of 598 steps
    Collatz(3732423) has sequence length of 597 steps
    Collatz(5598635) has sequence length of 595 steps
    Collatz(8397953) has sequence length of 593 steps
    Collatz(6298465) has sequence length of 590 steps
[download]

1e8 testing
     vr_pdl3:  9m44.667s  cache
     vr_pdl2: 16m12.467s  non-cache

chunk_size => 300000
    mce_pdl2:  9m06.078s   1 core
    mce_pdl2:  4m43.529s   2 cores
    mce_pdl2:  2m33.136s   4 cores
    mce_pdl2:  1m21.434s   8 cores
    mce_pdl2:    45.266s  16 cores
    mce_pdl2:    36.925s  32 cores

chunk_size => 100000
    mce_pdl2:  9m09.950s   1 core
    mce_pdl2:  4m39.677s   2 cores
    mce_pdl2:  2m24.230s   4 cores
    mce_pdl2:  1m13.353s   8 cores
    mce_pdl2:    37.923s  16 cores
    mce_pdl2:    20.099s  32 cores

Output
    Collatz(63728127) has sequence length of 950 steps
    Collatz(95592191) has sequence length of 948 steps
    Collatz(96883183) has sequence length of 811 steps
    Collatz(86010015) has sequence length of 798 steps
    Collatz(98110761) has sequence length of 749 steps
    Collatz(73583070) has sequence length of 746 steps
    Collatz(73583071) has sequence length of 746 steps
    Collatz(36791535) has sequence length of 745 steps
    Collatz(55187303) has sequence length of 743 steps
    Collatz(56924955) has sequence length of 743 steps
    Collatz(82780955) has sequence length of 741 steps
    Collatz(85387433) has sequence length of 741 steps
    Collatz(63101607) has sequence length of 738 steps
    Collatz(64040575) has sequence length of 738 steps
    Collatz(93128574) has sequence length of 736 steps
    Collatz(93128575) has sequence length of 736 steps
    Collatz(94652411) has sequence length of 736 steps
    Collatz(96060863) has sequence length of 736 steps
    Collatz(46564287) has sequence length of 735 steps
    Collatz(69846431) has sequence length of 733 steps
[download]

Specifying 50000 for chunk size may run faster on 4/6/8 core machines.

8 cores
    mce_pdl2 1e8  16.660s  chunk_size => 300000
    mce_pdl2 1e8  10.471s  chunk_size => 100000
    mce_pdl2 1e8   9.773s  chunk_size =>  50000
[download]

Regards, Mario

Comment on Re^2: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL fun) Select or Download Code

Replies are listed 'Best First'.
Re^3: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by vr (Curate) on Apr 26, 2020 at 00:03 UTC
Sorry for delay with reply/update, first I got distracted with "let's try another optimization", then with other/unrelated things. I hope this topic is still warm :). My "caching" PDL solution (I'd prefer "looking-up" to "caching" or "memoization" in this case) is slow because, if vector is processed as a whole, cache hits happen less frequently i.e. after quite a few more useless steps. To illustrate with sequence from original challenge: 23 → 70 → 35 → 106 → 53 → 160 → 80 → 40 → 20 → 10 → 5 → 16 → 8 → 4 → 2 → 1 Number 70 reaches 35 in only one step but 35 itself is yet very far from completion. Simplifying, if 1..70 range was divided into e.g. 2 chunks processed independently one after another, then it would alleviate the problem. This "chunking" is very easy to add, just wrap the main loop into one external loop (labelled "CHUNKS"), which updates chunk boundaries and sets "views" into piddles, while internal loop (labelled "ITERATIONS") does exactly the same as in previous version. The `$current` piddle can now be of chunk size. The very first chunk includes dummy 0th element (size is CHUNK + 1), the final one can theoretically be as short as 1 single element. For 1e7 numbers, the computation time drops by factor of 3, from ~65 to ~21 seconds, and doesn't change much (variations stay at "noise" level) for chunk sizes ~2e4 .. ~2e5. Next, the idea was, that as soon as chunk is completed, e.g. chunk 11..20 with MAX = 100 and CHUNK = 10, it should be extremely cheap/easy to immediately get Collatz `$lengths` for even numbers in range 22..40, also marking these `$seqs` elements as BAD/completed. Though it would be almost as easily achieved "in due time", we don't need to consider/apply any masks at this early/immediate stage. Further, then it follows, we can simply set, in bulk and at the very beginning, each even `$seqs` element outside 1st chunk as BAD, even if CLs are unknown yet. 2 fragments marked with double '##' do exactly what previous paragraph says. They can be freely disabled for experiments; though the speed-up is very modest, just ~0.5s. Further "ideas"/considerations didn't result in palpable improvements (nor were pursued systematically). If it's cheap to shift-left indexes and increment CLs for each chunk as we go, what if we also populate and keep sparse (even only) CL LUT in range MAX..2MAX? Well, this range starts being populated relatively late, and if there was benefit of occasional additional cache hits, it seems to be cancelled out. No gain, no loss. The MAXLEN was introduced to check this. I tried to add masks/slices/bads for `$current`, so that e.g. the line `$current ++;` applies to valid subset only, especially since now all elements at even positions are dead weight from the very start, in chunks 2+. Well, unconditional "en masse" cheap operation appears to be faster, regardless of uselessly tackling "dead weight". I also considered variable chunk sizes (such as "begin with very short", etc.), but gave up. use strict; use warnings; use feature 'say'; use PDL; use List::Util; BEGIN { _min = \&List::Util::min; # collision _max = \&List::Util::max } # with PDL use constant MAX => 1e7; use constant TOP => _min( 20, MAX ); use constant CHUNK => _min( 8e4, MAX ); # but keep it even use constant MAXLEN => MAX * 1; # ?? # x(1..2) use Time::HiRes 'time'; my $t = time; my $seqs = sequence( longlong, 1 + MAX ); $seqs-> setbadat( 0 ); $seqs-> setbadat( 1 ); $seqs-> badvalue( 2 ); $seqs-> slice([ CHUNK + 2, MAX, 2]) .= 2 ## if CHUNK + 2 <= MAX; ## my $lengths = ones( short, 1 + MAXLEN ); $lengths-> inplace-> setvaltobad( 1 ); $lengths-> set( 1, 1 ); $lengths-> set( 2, 2 ); $lengths-> set( 4, 3 ); CHUNKS: for ( my $from = my $to = 0; $to != MAX; $from = $to + 1 ) { $to = _min( $from + CHUNK, MAX ); # "_c" is for "chunk" my $seqs_c = $seqs-> slice([ $from, $to ]); my $lengths_c = $lengths-> slice([ $from, $to ]); my $current = zeroes( short, nelem( $seqs_c )); ITERATIONS: while ( any $seqs_c-> isgood ) { my ( $seqs_c_odd, $current_odd_masked ) = where( $seqs_c, $current, $seqs_c & 1 ); $current_odd_masked ++; $current ++; ( $seqs_c_odd = 3 ) ++; $seqs_c >>= 1; my ( $seqs_cap, $lengths_cap, $current_cap ) = where( $seqs_c, $lengths_c, $current, $seqs_c <= MAXLEN ); my $lut = $lengths-> index( $seqs_cap ); # "_f" is for "finished" my ( $seqs_f, $lengths_f, $lut_f, $current_f ) = where( $seqs_cap, $lengths_cap, $lut, $current_cap, $lut-> isgood ); $lengths_f .= $lut_f + $current_f; $seqs_f .= 2; # i.e. BAD } # "_e" is for "at even positions, ahead" ## ## # my $from_e = _max( $from 2, $to ) + 2; # bug ## my $from_e = $from == 0 ? $to + 2 : $from * 2; # fixed ## my $to_e = _min( $to * 2, MAXLEN ); ## ## ( $lengths-> slice([ $from_e, $to_e, 2 ]) ## .= $lengths-> slice([ $from_e / 2, $to_e / 2 ])) ++ ## if $from_e <= MAXLEN ## } # same finale $lengths-> badflag( 0 ); $lengths = $lengths-> slice([ 1, MAX ]); my $sorted_i = $lengths-> qsorti; my $sorted = $lengths-> index( $sorted_i ); my $value = $sorted-> at( MAX - TOP ); my $pos = vsearch_insert_leftmost( $value, $sorted ); my $top_i = $sorted_i-> slice([ MAX - 1, $pos ]); ( my $result = $lengths -> index( $top_i ) -> longlong -> bitnot -> cat( $top_i + 1 ) -> transpose -> qsortvec -> slice([], [ 0, TOP - 1 ]) )-> slice([ 0 ], []) -> inplace -> bitnot; say $result; say time - $t; __END__ [download] Edit (bug fix). :(( With "dummy 0th element prepended", my intention for chunks was e.g. "0-10,11-20,21-30, ..., i.e. 1st is one longer. Then I fooled with "better presentation", from "infinite" while loop, to while loop with explicit condition, to for loop, with slightly different formulas for from,to,from_e,to_e, and messed up. Some even lengths aren't calculated, as seen with MAX and CHUNK e.g. 10 and 4. Easiest fix now is to leave chunks all equal (CHUNK+1); one LOC fixed (see "fixed"), above. Sorry.	[reply] [d/l] [select]
Re^4: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by marioroy (Prior) on Apr 26, 2020 at 07:46 UTC
Hi vr and all, This is a parallel version for UNIX and Windows. The lengths piddle is shared using PDL::IO::FastRaw. Update 1: Construct seqs_c from inside workers to consume lesser memory consumption. Update 2: Applied vr's bug fix. Plus added 2nd example that runs parallel on Windows. Update 3: Output not 100% consistent (1e7 diff). Not suited for parallelism due to cache miss. The non-PDL solutions here and here handle cache miss. But not yet here. `9,12c9,12 < [ 616 7532665] # correct < [ 613 5649499] < [ 611 8474249] < [ 608 6355687] --- > [ 615 7532665] # wrong > [ 612 5649499] > [ 610 8474249] > [ 607 6355687] 14c14 < [ 606 9533531] # correct --- > [ 605 9533531] # wrong` [download] Update 4: Output now 100% consistent, possible with MCE::relay. UNIX: use strict; use warnings; use feature 'say'; BEGIN { # Does not work on Windows unfortunately. die "Sorry, this script requires a UNIX based OS, exiting...\n" if $^O eq 'MSWin32'; } use PDL; use File::Map; # ensure that Perl has File::Map before loading FastRaw use PDL::IO::FastRaw; use MCE::Signal '$tmp_dir'; use MCE::Flow; { no warnings 'once'; $PDL::BIGPDL = 1; eval q{ PDL::set_autopthread_targ(1) }; } use List::Util; BEGIN { _min = \&List::Util::min; # collision _max = \&List::Util::max } # with PDL use constant MAX => shift \|\| 1e7; use constant TOP => _min( 20, MAX ); use constant CHUNK => _min( 40000, MAX ); # but keep it even use constant MAXLEN => MAX * 1; # ?? # x(1..2) use Time::HiRes 'time'; # create a raw file for lengths writefraw( ones( short, 3 + MAXLEN ), "$tmp_dir/lengths" ); # memory map the raw file my $lengths = mapfraw( "$tmp_dir/lengths" ); $lengths-> inplace-> setvaltobad( 1 ); $lengths-> set( 1, 1 ); $lengths-> set( 2, 2 ); $lengths-> set( 4, 3 ); my $t = time; mce_flow_s { max_workers => MCE::Util::get_ncpu(), chunk_size => CHUNK + 1, bounds_only => 1, init_relay => 1, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my ( $from, $to ) = @{ $chunk_ref }; my $seqs_c = $from + sequence( longlong, $to - $from + 1 ); if ( $chunk_id == 1 ) { $seqs_c-> setbadat( 0 ); $seqs_c-> setbadat( 1 ); $seqs_c-> badvalue( 2 ); } else { $seqs_c-> setbadat( $from % 2 ? 1 : 0 ); $seqs_c-> slice([ $from % 2 ? 1 : 0, $to - $from, 2 ]) .= 2 +; $seqs_c-> badvalue( 2 ); } my $lengths_c = $lengths-> slice([ $from, $to ]); my $current = zeroes( short, nelem( $seqs_c )); while ( any $seqs_c-> isgood ) { my ( $seqs_c_odd, $current_odd_masked ) = where( $seqs_c, $current, $seqs_c & 1 ); $current_odd_masked ++; $current ++; ( $seqs_c_odd = 3 ) ++; $seqs_c >>= 1; my ( $seqs_cap, $lengths_cap, $current_cap ) = where( $seqs_c, $lengths_c, $current, $seqs_c <= MAXLEN ); my $lut = $lengths-> index( $seqs_cap ); # "_f" is for "finished" my ( $seqs_f, $lengths_f, $lut_f, $current_f ) = where( $seqs_cap, $lengths_cap, $lut, $current_cap, $lut-> isgood ); $lengths_f .= $lut_f + $current_f; $seqs_f .= 2; # i.e. BAD } # "_e" is for "at even positions, ahead" ## ## # my $from_e = _max( $from 2, $to ) + 2; # bug ## my $from_e = $from == 0 ? $to + 2 : $from * 2; # fixed ## my $to_e = _min( $to * 2, MAXLEN ); ## ## MCE::relay { ## ( $lengths-> slice([ $from_e, $to_e, 2 ]) ## .= $lengths-> slice([ $from_e / 2, $to_e / 2 ])) ++ ## if $from_e <= MAXLEN; ## }; ## }, 0, MAX; MCE::Flow->finish; say {STDERR} "compute time: ", time - $t; # same finale $lengths-> badflag( 0 ); $lengths = $lengths-> slice([ 1, MAX ]); my $sorted_i = $lengths-> qsorti; my $sorted = $lengths-> index( $sorted_i ); my $value = $sorted-> at( MAX - TOP ); my $pos = vsearch_insert_leftmost( $value, $sorted ); my $top_i = $sorted_i-> slice([ MAX - 1, $pos ]); ( my $result = $lengths -> index( $top_i ) -> longlong -> bitnot -> cat( $top_i + 1 ) -> transpose -> qsortvec -> slice([], [ 0, TOP - 1 ]) )-> slice([ 0 ], []) -> inplace -> bitnot; say $result; say {STDERR} "total time: ", time - $t; [download] Windows: The following script works on Windows with up to 8 workers. Specifying higher than 8 workers causes PDL to emit, "PDL::Internal Error: data structure recursion limit exceeded (max 1000 levels)". I also tested on Linux. No problems there including 32 workers. use strict; use warnings; use feature 'say'; use PDL; use File::Map; # ensure that Perl has File::Map before loading FastRaw use PDL::IO::FastRaw; use MCE::Signal '$tmp_dir'; use MCE::Flow; { no warnings 'once'; $PDL::BIGPDL = 1; eval q{ PDL::set_autopthread_targ(1) }; } use List::Util; BEGIN { _min = \&List::Util::min; # collision _max = \&List::Util::max } # with PDL use constant MAX => shift \|\| 1e7; use constant TOP => _min( 20, MAX ); use constant CHUNK => _min( 40000, MAX ); # but keep it even use constant MAXLEN => MAX * 1; # ?? # x(1..2) use Time::HiRes 'time'; # create a raw file for lengths writefraw( ones( short, 3 + MAXLEN ), "$tmp_dir/lengths" ); my $max_workers = $^O eq 'MSWin32' ? 8 : MCE::Util::get_ncpu(); my $t = time; my $lengths; MCE::Flow->init( max_workers => _min( $max_workers, MCE::Util::get_ncpu() ), chunk_size => CHUNK + 1, bounds_only => 1, init_relay => 1, user_begin => sub { $lengths = mapfraw( "$tmp_dir/lengths" ); if ( MCE->wid == 1 ) { $lengths-> inplace-> setvaltobad( 1 ); $lengths-> set( 1, 1 ); $lengths-> set( 2, 2 ); $lengths-> set( 4, 3 ); } MCE->sync; }, ); mce_flow_s sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my ( $from, $to ) = @{ $chunk_ref }; my $seqs_c = $from + sequence( longlong, $to - $from + 1 ); if ( $chunk_id == 1 ) { $seqs_c-> setbadat( 0 ); $seqs_c-> setbadat( 1 ); $seqs_c-> badvalue( 2 ); } else { $seqs_c-> setbadat( $from % 2 ? 1 : 0 ); $seqs_c-> slice([ $from % 2 ? 1 : 0, $to - $from, 2 ]) .= 2 +; $seqs_c-> badvalue( 2 ); } my $lengths_c = $lengths-> slice([ $from, $to ]); my $current = zeroes( short, nelem( $seqs_c )); while ( any $seqs_c-> isgood ) { my ( $seqs_c_odd, $current_odd_masked ) = where( $seqs_c, $current, $seqs_c & 1 ); $current_odd_masked ++; $current ++; ( $seqs_c_odd = 3 ) ++; $seqs_c >>= 1; my ( $seqs_cap, $lengths_cap, $current_cap ) = where( $seqs_c, $lengths_c, $current, $seqs_c <= MAXLEN ); my $lut = $lengths-> index( $seqs_cap ); # "_f" is for "finished" my ( $seqs_f, $lengths_f, $lut_f, $current_f ) = where( $seqs_cap, $lengths_cap, $lut, $current_cap, $lut-> isgood ); $lengths_f .= $lut_f + $current_f; $seqs_f .= 2; # i.e. BAD } # "_e" is for "at even positions, ahead" ## ## # my $from_e = _max( $from 2, $to ) + 2; # bug ## my $from_e = $from == 0 ? $to + 2 : $from * 2; # fixed ## my $to_e = _min( $to * 2, MAXLEN ); ## ## MCE::relay { ## ( $lengths-> slice([ $from_e, $to_e, 2 ]) ## .= $lengths-> slice([ $from_e / 2, $to_e / 2 ])) ++ ## if $from_e <= MAXLEN; ## }; ## }, 0, MAX; MCE::Flow->finish; say {STDERR} "compute time: ", time - $t; # same finale $lengths = mapfraw( "$tmp_dir/lengths" ); $lengths-> badflag( 0 ); $lengths = $lengths-> slice([ 1, MAX ]); my $sorted_i = $lengths-> qsorti; my $sorted = $lengths-> index( $sorted_i ); my $value = $sorted-> at( MAX - TOP ); my $pos = vsearch_insert_leftmost( $value, $sorted ); my $top_i = $sorted_i-> slice([ MAX - 1, $pos ]); ( my $result = $lengths -> index( $top_i ) -> longlong -> bitnot -> cat( $top_i + 1 ) -> transpose -> qsortvec -> slice([], [ 0, TOP - 1 ]) )-> slice([ 0 ], []) -> inplace -> bitnot; say $result; say {STDERR} "total time: ", time - $t; [download] Results: Compute time excludes initial piddle creation and final sorting/output as these run serially. Total time is captured by running time perl script.pl. 1e7: serial total 15.272s, compute 14.674s 1 core parallel total 8.327s, compute 7.478s 2 cores parallel total 4.726s, compute 3.912s 4 cores parallel total 2.855s, compute 2.025s 8 cores parallel total 1.911s, compute 1.085s 16 cores parallel total 1.484s, compute 0.835s 32 cores [ [ 686 8400511] [ 668 8865705] [ 665 6649279] [ 663 9973919] [ 621 6674175] [ 616 7332399] [ 616 7532665] [ 613 5649499] [ 611 8474249] [ 608 6355687] [ 606 8847225] [ 606 9533531] [ 603 6635419] [ 601 9953129] [ 598 7464846] [ 598 7464847] [ 597 3732423] [ 595 5598635] [ 593 8397953] [ 590 6298465] ] 1e8: parallel total 11.779s, compute 5.631s 32 cores [download] Regards, Mario	[reply] [d/l] [select]
Re^5: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by marioroy (Prior) on Apr 26, 2020 at 22:14 UTC
Greetings, The prior demonstration involves sorting (qsorti) using one core, taking ~ 5 extra seconds for 1e8. That led me to try parallel sorting inside user_end. Code for both UNIX and Windows are provided. Update 1: Output not 100% consistent. Not suited for parallelism due to cache miss. The non-PDL solutions here and here handle cache miss. But not yet here. Update 2: Output now 100% consistent, possible with MCE::relay. UNIX: use strict; use warnings; use feature 'say'; BEGIN { # Does not work on Windows unfortunately. die "Sorry, this script requires a UNIX based OS, exiting...\n" if $^O eq 'MSWin32'; } use PDL; use File::Map; # ensure that Perl has File::Map before loading FastRaw use PDL::IO::FastRaw; use MCE::Signal '$tmp_dir'; use MCE::Flow; use MCE::Candy; { no warnings 'once'; $PDL::BIGPDL = 1; eval q{ PDL::set_autopthread_targ(1) }; } use List::Util; BEGIN { _min = \&List::Util::min; # collision _max = \&List::Util::max } # with PDL use constant MAX => shift \|\| 1e7; use constant TOP => _min( 20, MAX ); use constant CHUNK => _min( 40000, MAX ); # but keep it even use constant MAXLEN => MAX * 1; # ?? # x(1..2) use Time::HiRes 'time'; my $t = time; # create a raw file for lengths writefraw( ones( short, 3 + MAXLEN ), "$tmp_dir/lengths" ); # memory map the raw file my $lengths = mapfraw( "$tmp_dir/lengths" ); $lengths-> inplace-> setvaltobad( 1 ); $lengths-> set( 1, 1 ); $lengths-> set( 2, 2 ); $lengths-> set( 4, 3 ); my @top_seqs; MCE::Flow->init( max_workers => MCE::Util::get_ncpu(), chunk_size => CHUNK + 1, bounds_only => 1, init_relay => 1, gather => MCE::Candy::out_iter_array(\@top_seqs), user_end => sub { # wait for any remaining workers to complete processing MCE->sync; my $size = MAX / MCE->max_workers + 1; my $from = ( MCE->wid - 1 ) * $size + 1; my $to = $from + $size; $from++ if $from > 1; $to = MAX if $to > MAX; my $lengths_c = $lengths-> slice([ $from, $to ]); $lengths_c-> badflag( 0 ); my $sorted_i = $lengths_c-> qsorti; my $sorted = $lengths_c-> index( $sorted_i ); my $value = $sorted-> at( $to - $from - TOP ); my $pos = vsearch_insert_leftmost( $value, $sorted ); my $top_i = $sorted_i-> slice([ $to - $from, $pos ]); ( my $result = $lengths_c -> index( $top_i ) -> longlong -> bitnot -> cat( $top_i + $from ) -> transpose -> qsortvec -> slice([], [ 0, TOP - 1 ]) )-> slice([ 0 ], []) -> inplace -> bitnot; # From PDL to Perl: [ 0 1 ] becomes [ 1, 0 ], my $str = $result->string; $str =~ s/(\d+)\s+(\d+)(.)/$2,$1$3,/g; my $ret = eval $str; MCE->gather( MCE->wid, @$ret ); }, ); mce_flow_s sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my ( $from, $to ) = @{ $chunk_ref }; my $seqs_c = $from + sequence( longlong, $to - $from + 1 ); if ( $chunk_id == 1 ) { $seqs_c-> setbadat( 0 ); $seqs_c-> setbadat( 1 ); $seqs_c-> badvalue( 2 ); } else { $seqs_c-> setbadat( $from % 2 ? 1 : 0 ); $seqs_c-> slice([ $from % 2 ? 1 : 0, $to - $from, 2 ]) .= 2 +; $seqs_c-> badvalue( 2 ); } my $lengths_c = $lengths-> slice([ $from, $to ]); my $current = zeroes( short, nelem( $seqs_c )); while ( any $seqs_c-> isgood ) { my ( $seqs_c_odd, $current_odd_masked ) = where( $seqs_c, $current, $seqs_c & 1 ); $current_odd_masked ++; $current ++; ( $seqs_c_odd = 3 ) ++; $seqs_c >>= 1; my ( $seqs_cap, $lengths_cap, $current_cap ) = where( $seqs_c, $lengths_c, $current, $seqs_c <= MAXLEN ); my $lut = $lengths-> index( $seqs_cap ); # "_f" is for "finished" my ( $seqs_f, $lengths_f, $lut_f, $current_f ) = where( $seqs_cap, $lengths_cap, $lut, $current_cap, $lut-> isgood ); $lengths_f .= $lut_f + $current_f; $seqs_f .= 2; # i.e. BAD } # "_e" is for "at even positions, ahead" ## ## # my $from_e = _max( $from * 2, $to ) + 2; # bug ## my $from_e = $from == 0 ? $to + 2 : $from * 2; # fixed ## my $to_e = _min( $to * 2, MAXLEN ); ## ## MCE::relay { ## ( $lengths-> slice([ $from_e, $to_e, 2 ]) ## .= $lengths-> slice([ $from_e / 2, $to_e / 2 ])) ++ ## if $from_e <= MAXLEN; ## }; ## }, 0, MAX; MCE::Flow->finish; @top_seqs = ( sort { $b->[1] <=> $a->[1]} @top_seqs )[ 0..19 ]; printf "Collatz(%5d) has sequence length of %3d steps\n", @$_ for @top_seqs; say {STDERR} time - $t; [download] Windows:* The following script works on Windows with up to 8 workers. Specifying higher than 8 workers causes PDL to emit, "PDL::Internal Error: data structure recursion limit exceeded (max 1000 levels)". I also tested on Linux. No problems there including 32 workers. use strict; use warnings; use feature 'say'; use PDL; use File::Map; # ensure that Perl has File::Map before loading FastRaw use PDL::IO::FastRaw; use MCE::Signal '$tmp_dir'; use MCE::Flow; use MCE::Candy; { no warnings 'once'; $PDL::BIGPDL = 1; eval q{ PDL::set_autopthread_targ(1) }; } use List::Util; BEGIN { _min = \&List::Util::min; # collision _max = \&List::Util::max } # with PDL use constant MAX => shift \|\| 1e7; use constant TOP => _min( 20, MAX ); use constant CHUNK => _min( 40000, MAX ); # but keep it even use constant MAXLEN => MAX * 1; # ?? # x(1..2) use Time::HiRes 'time'; my $t = time; # create a raw file for lengths writefraw( ones( short, 3 + MAXLEN ), "$tmp_dir/lengths" ); my $max_workers = $^O eq 'MSWin32' ? 8 : MCE::Util::get_ncpu(); my @top_seqs; my $lengths; MCE::Flow->init( max_workers => _min( $max_workers, MCE::Util::get_ncpu() ), chunk_size => CHUNK + 1, bounds_only => 1, init_relay => 1, gather => MCE::Candy::out_iter_array(\@top_seqs), user_begin => sub { $lengths = mapfraw( "$tmp_dir/lengths" ); if ( MCE->wid == 1 ) { $lengths-> inplace-> setvaltobad( 1 ); $lengths-> set( 1, 1 ); $lengths-> set( 2, 2 ); $lengths-> set( 4, 3 ); } MCE->sync; }, user_end => sub { # wait for any remaining workers to complete processing MCE->sync; my $size = MAX / MCE->max_workers + 1; my $from = ( MCE->wid - 1 ) * $size + 1; my $to = $from + $size; $from++ if $from > 1; $to = MAX if $to > MAX; my $lengths_c = $lengths-> slice([ $from, $to ]); $lengths_c-> badflag( 0 ); my $sorted_i = $lengths_c-> qsorti; my $sorted = $lengths_c-> index( $sorted_i ); my $value = $sorted-> at( $to - $from - TOP ); my $pos = vsearch_insert_leftmost( $value, $sorted ); my $top_i = $sorted_i-> slice([ $to - $from, $pos ]); ( my $result = $lengths_c -> index( $top_i ) -> longlong -> bitnot -> cat( $top_i + $from ) -> transpose -> qsortvec -> slice([], [ 0, TOP - 1 ]) )-> slice([ 0 ], []) -> inplace -> bitnot; # From PDL to Perl: [ 0 1 ] becomes [ 1, 0 ], my $str = $result->string; $str =~ s/(\d+)\s+(\d+)(.)/$2,$1$3,/g; my $ret = eval $str; MCE->gather( MCE->wid, @$ret ); }, ); mce_flow_s sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my ( $from, $to ) = @{ $chunk_ref }; my $seqs_c = $from + sequence( longlong, $to - $from + 1 ); if ( $chunk_id == 1 ) { $seqs_c-> setbadat( 0 ); $seqs_c-> setbadat( 1 ); $seqs_c-> badvalue( 2 ); } else { $seqs_c-> setbadat( $from % 2 ? 1 : 0 ); $seqs_c-> slice([ $from % 2 ? 1 : 0, $to - $from, 2 ]) .= 2 +; $seqs_c-> badvalue( 2 ); } my $lengths_c = $lengths-> slice([ $from, $to ]); my $current = zeroes( short, nelem( $seqs_c )); while ( any $seqs_c-> isgood ) { my ( $seqs_c_odd, $current_odd_masked ) = where( $seqs_c, $current, $seqs_c & 1 ); $current_odd_masked ++; $current ++; ( $seqs_c_odd = 3 ) ++; $seqs_c >>= 1; my ( $seqs_cap, $lengths_cap, $current_cap ) = where( $seqs_c, $lengths_c, $current, $seqs_c <= MAXLEN ); my $lut = $lengths-> index( $seqs_cap ); # "_f" is for "finished" my ( $seqs_f, $lengths_f, $lut_f, $current_f ) = where( $seqs_cap, $lengths_cap, $lut, $current_cap, $lut-> isgood ); $lengths_f .= $lut_f + $current_f; $seqs_f .= 2; # i.e. BAD } # "_e" is for "at even positions, ahead" ## ## # my $from_e = _max( $from * 2, $to ) + 2; # bug ## my $from_e = $from == 0 ? $to + 2 : $from * 2; # fixed ## my $to_e = _min( $to * 2, MAXLEN ); ## ## MCE::relay { ## ( $lengths-> slice([ $from_e, $to_e, 2 ]) ## .= $lengths-> slice([ $from_e / 2, $to_e / 2 ])) ++ ## if $from_e <= MAXLEN; ## }; ## }, 0, MAX; MCE::Flow->finish; @top_seqs = ( sort { $b->[1] <=> $a->[1]} @top_seqs )[ 0..19 ]; printf "Collatz(%5d) has sequence length of %3d steps\n", @$_ for @top_seqs; say {STDERR} time - $t; [download] Results:* time perl script.pl 1e7: serial 15.311s 1 core parallel 7.898s 2 cores parallel 4.229s 4 cores parallel 2.244s 8 cores parallel 1.265s 16 cores parallel 0.815s 32 cores 1e8: serial 2m38.645s 1 core parallel 11.779s 32 cores before: serial qsorti parallel 6.652s 32 cores after : parallel qsorti parallel 2.656s 32 cores non-PDL solution parallel 1.644s 32 cores non-PDL solution with Inline::C https://perlmonks.org/?node_id=11115780 Collatz(63728127) has sequence length of 950 steps Collatz(95592191) has sequence length of 948 steps Collatz(96883183) has sequence length of 811 steps Collatz(86010015) has sequence length of 798 steps Collatz(98110761) has sequence length of 749 steps Collatz(73583070) has sequence length of 746 steps Collatz(73583071) has sequence length of 746 steps Collatz(36791535) has sequence length of 745 steps Collatz(55187303) has sequence length of 743 steps Collatz(56924955) has sequence length of 743 steps Collatz(82780955) has sequence length of 741 steps Collatz(85387433) has sequence length of 741 steps Collatz(63101607) has sequence length of 738 steps Collatz(64040575) has sequence length of 738 steps Collatz(93128574) has sequence length of 736 steps Collatz(93128575) has sequence length of 736 steps Collatz(94652411) has sequence length of 736 steps Collatz(96060863) has sequence length of 736 steps Collatz(46564287) has sequence length of 735 steps Collatz(69846431) has sequence length of 733 steps [download] Regards, Mario	[reply] [d/l] [select]
Re^6: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by etj (Deacon) on Apr 22, 2022 at 13:11 UTC
Re^7: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by marioroy (Prior) on Apr 23, 2022 at 16:15 UTC
Some notes below your chosen depth have not been shown here

In Section Meditations