http://qs321.pair.com?node_id=11115544

Dearest Monks of the Monastery,

Recently, I tried computing the longest Collatz progression here. I was pleasantly surprised by File::Map's performance. Our fellow monk Laurent_R posted an update to his original code for computing the Collatz sequences. And what a speedup it is.

Here, I want to try Laurent's code and run parallel. Yes, with caching. The first thing I do typically is apply optimizations to the serial implementation. Because you know, just think of any domino impact running parallel might have. See my update to Laurent's code. That went well and so will take collatz3_e.pl there and use File::Map here. This is exciting for me because this is a great use case for File::Map for running the algorithm in parallel. But with all things, a serial version using File::Map is needed for comparison.

Update 1: Map using map_anonymous, previously map_file.
Update 2: Use 16-bit signed integer for pack/unpack.

Note: The OS must have ~ 3.8 GiB of available memory to compute size 1e9.

collatz3_filemap:

#!/usr/bin/env perl use strict; use warnings; use File::Map qw/map_anonymous unmap/; my $size = shift || 1e6; $size = 1e6 if $size < 1e6; # minimum $size = 1e9 if $size > 1e9; # maximum map_anonymous my $cache, $size * 2, 'shared'; # init cache with -1's, then set 0, 1, 2 substr($cache, 0, $size * 2, ( my $neg1 = pack('s', -1) ) x $size); substr($cache, $_ * 2, 2, pack('s', $_)) for 0..2; my @seqs; sub collatz_seq { my ( $seq_beg, $seq_end ) = @_; my ( $n, $steps, $tmp ); for my $input ( $seq_beg..$seq_end ) { $n = $input, $steps = 0; while ( $n != 1 ) { $steps += unpack('s', $tmp), last if ($n < $size && ($tmp = substr($cache, $n * 2, 2)) n +e $neg1); $n % 2 ? ( $steps += 2, $n = (3 * $n + 1) >> 1 ) : ( $steps += 1, $n = $n >> 1 ); } substr($cache, $input * 2, 2, pack('s', $steps)) if $input < $ +size; push @seqs, [ $input, $steps ] if $steps > 400; } } collatz_seq(2, $size); unmap $cache; @seqs = ( sort { $b->[1] <=> $a->[1]} @seqs )[ 0..19 ]; printf "Collatz(%5d) has sequence length of %3d steps\n", @$_ for @seqs;

collatz3_parallel:

This is the serial implementation converted to run parallel. The collatz_seq function is identical, no changes there.

#!/usr/bin/env perl use strict; use warnings; use File::Map qw/map_anonymous unmap/; use MCE::Flow; use MCE::Candy; my $size = shift || 1e6; $size = 1e6 if $size < 1e6; # minimum $size = 1e9 if $size > 1e9; # maximum map_anonymous my $cache, $size * 2, 'shared'; # init cache with -1's, then set 0, 1, 2 substr($cache, 0, $size * 2, ( my $neg1 = pack('s', -1) ) x $size); substr($cache, $_ * 2, 2, pack('s', $_)) for 0..2; # local to workers and the manager process my @seqs; sub collatz_seq { my ( $seq_beg, $seq_end ) = @_; my ( $n, $steps, $tmp ); for my $input ( $seq_beg..$seq_end ) { $n = $input, $steps = 0; while ( $n != 1 ) { $steps += unpack('s', $tmp), last if ($n < $size && ($tmp = substr($cache, $n * 2, 2)) n +e $neg1); $n % 2 ? ( $steps += 2, $n = (3 * $n + 1) >> 1 ) : ( $steps += 1, $n = $n >> 1 ); } substr($cache, $input * 2, 2, pack('s', $steps)) if $input < $ +size; push @seqs, [ $input, $steps ] if $steps > 400; } } my $chunk_size; $chunk_size = int( $size / MCE::Util::get_ncpu() / 80 + 1 ); $chunk_size += 1 if $chunk_size % 2; mce_flow_s { max_workers => MCE::Util::get_ncpu(), chunk_size => $chunk_size, bounds_only => 1, gather => MCE::Candy::out_iter_array(\@seqs), }, sub { my ($mce, $chunk_ref, $chunk_id) = @_; collatz_seq(@{ $chunk_ref }); @seqs > 20 ? MCE->gather($chunk_id, ( sort { $b->[1] <=> $a->[1] } @seqs +)[ 0..19 ]) : MCE->gather($chunk_id, @seqs); @seqs = (); }, 2, $size; MCE::Flow->finish; unmap $cache; @seqs = ( sort { $b->[1] <=> $a->[1]} @seqs )[ 0..19 ]; printf "Collatz(%5d) has sequence length of %3d steps\n", @$_ for @seqs;

Results:

Caching using File::Map obviously will have overhead plus having to serialize/unserialize using pack/unpack.

$ time perl collatz3_a.pl 1e7 AMD 3970x, Docker Container, Ubuntu + Perlbrew Perl 5.30.1 collatz3_a.pl 1e7 13.130s (a) original, accepts argument collatz3_b.pl 1e7 12.394s (b) a + replaced division with >> 1 collatz3_c.pl 1e7 12.261s (c) b + removed 1 level of branching collatz3_d.pl 1e7 9.170s (d) c + reduced loop iterations collatz3_e.pl 1e7 7.681s (e) d + less caching collatz3_filemap 8.889s 1 core collatz3_parallel 8.982s 1 core collatz3_parallel 4.548s 2 cores collatz3_parallel 2.378s 4 cores collatz3_parallel 1.233s 8 cores collatz3_parallel 0.661s 16 cores collatz3_parallel 0.408s 32 cores Collatz(8400511) has sequence length of 686 steps Collatz(8865705) has sequence length of 668 steps Collatz(6649279) has sequence length of 665 steps Collatz(9973919) has sequence length of 663 steps Collatz(6674175) has sequence length of 621 steps Collatz(7332399) has sequence length of 616 steps Collatz(7532665) has sequence length of 616 steps Collatz(5649499) has sequence length of 613 steps Collatz(8474249) has sequence length of 611 steps Collatz(6355687) has sequence length of 608 steps Collatz(8847225) has sequence length of 606 steps Collatz(9533531) has sequence length of 606 steps Collatz(6635419) has sequence length of 603 steps Collatz(9953129) has sequence length of 601 steps Collatz(7464846) has sequence length of 598 steps Collatz(7464847) has sequence length of 598 steps Collatz(3732423) has sequence length of 597 steps Collatz(5598635) has sequence length of 595 steps Collatz(8397953) has sequence length of 593 steps Collatz(6298465) has sequence length of 590 steps

Some will say, let's add cores. Some will say, let's improve the algorithm. Few might say, let's try both. It turns out that caching and parallel are possible. It's unbelievable, TBH. Processors are equipped with many CPU cores. I made the time to try and retry. Mainly, for future Monks to the Monastery, way after I'm gone. Years ago the saying was, "IO and Parallel" isn't possible. Input IO in MCE is sequential, not random.

What I have witnessed tonight is that Meta::Cpan is a treasure box. In other words, a big gigantic box of Legos. I opened the box and picked out File::Map, MCE::Flow and then went over to this wonderful Monastery. There I looked for Laurent_R's code.

I tried this not knowing what to expect. This is my first time using File::Map with many workers.

Regards, Mario

Replies are listed 'Best First'.
Re: MCE::Flow + Caching via File::Map
by marioroy (Prior) on Apr 19, 2020 at 06:44 UTC

    Greetings,

    I am pleased to provide the final work. Computing collatz_seq (top 20) for 1e9 requires ~ 3.8 GiB of available memory. Try 5e8 for lesser memory consumption.

    This is a parallel demonstration using MCE::Flow and File::Map for caching. The 2nd example uses Inline::C for counting the number of steps. Unlike the other examples, workers working on chunks 2 and higher require results from prior chunks. This is accounted for. Obviously the worker processing chunk 1 doesn't need prior results. Once the workers being processing (i.e. after the mapped cache creation), it doesn't take long before full CPU utilization kicks in.

    Q. Why does this work, especially when subsequent chunks need results from prior chunks?

    A. The magic lies with using a smaller chunk size set to 2500. The initial ramp up is a one time occurrence. One cool thing about MCE is that input IO is sequential. This applies to number sequences as well. A worker obtaining chunk 1 begins processing immediately (there's no wasting time). The worker obtaining the next chunk begins processing but may need to pause a little here and there. Eventually, the chunks as far as timing goes (starting) are spread out where workers need to pause left often. This can be seen by looking at the CPU utilization. The Power of Randomness kicks in at some point with CPU utilization near 100% until completion.

    Credit to PerlMonks choroba, Laurent_R, 1nickt, rjt, and vr. See this thread. Worthy mention goes to Leon Timmermans, author of File::Map. Wow!

    Credit for the caching technique used here is based on the caching demonstration by iM71, a response in that thread. My first attempt at parallelization failed. I tried again by maximizing on MCE's strengths described above. It's surreal :)

    Credit for reducing the number of loop iterations was from watching Notation and compressed dynamics, one minute into the video (i.e. the T(x) notation).

    Below, the minimum and maximum argument (size) is 1e6 and 1e9 respectively. The two scripts will set to limit quietly if exceeded.

    See also this thread for computing the longest Collatz.

    Cache miss update:

    Cache miss is less than 1%. Therefore, it is faster to compute for $n than waiting for the result.

    Final update:

    #!/usr/bin/env perl use strict; use warnings; use File::Map qw/map_anonymous unmap/; use Time::HiRes qw/usleep/; use MCE::Flow; use MCE::Candy; my $size = shift || 1e6; $size = 1e6 if $size < 1e6; # minimum $size = 1e9 if $size > 1e9; # maximum map_anonymous my $cache, $size * 2 + 2, 'shared'; # fill cache with zeroes substr($cache, 0, $size * 2 + 2, pack('s', 0) x ( $size + 1 )); # local to workers and the manager process my @seqs; sub collatz_seq { my ( $chunk_id, $seq_beg, $seq_end ) = @_; my ( $n, $steps, $tmp ); for my $input ( $seq_beg..$seq_end ) { $n = $input, $steps = 0; $n % 2 ? ( $steps += 2, $n = (3 * $n + 1) >> 1 ) : ( $steps += 1, $n = $n >> 1 ) while $n != 1 && $n >= $input; $tmp = unpack('s', substr($cache, $n * 2, 2)); # another worker with a lesser chunk_id is not yet # completed processing $n, so pause a little # if ( $tmp == 0 && $chunk_id > 1 ) { # do { # usleep 100; # $tmp = unpack('s', substr($cache, $n * 2, 2)); # } while ( $tmp == 0 ); # } # do this instead (faster): compute $n if cache miss $tmp = _collatz($n) if $tmp == 0 && $chunk_id > 1; substr($cache, $input * 2, 2, pack('s', $steps += $tmp)); push @seqs, [ $input, $steps + 1 ] if $steps > 400; } } sub _collatz { my ( $input ) = @_; my ( $n, $steps ) = ( $input, 0 ); $n % 2 ? ( $steps += 2, $n = (3 * $n + 1) >> 1 ) : ( $steps += 1, $n = $n >> 1 ) while $n != 1 && $n >= $input; my $tmp = unpack('s', substr($cache, $n * 2, 2)); $tmp = _collatz($n) if $tmp == 0; substr($cache, $input * 2, 2, pack('s', $steps += $tmp)); return $steps } my $chunk_size; $chunk_size = int( $size / MCE::Util::get_ncpu() / 80 + 1 ); $chunk_size += 1 if $chunk_size % 2; MCE::Flow->init( max_workers => MCE::Util::get_ncpu(), chunk_size => $chunk_size, # specify 2500 if pausing above bounds_only => 1, gather => MCE::Candy::out_iter_array(\@seqs), ); mce_flow_s sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; collatz_seq($chunk_id, @{ $chunk_ref }); @seqs > 20 ? MCE->gather($chunk_id, ( sort { $b->[1] <=> $a->[1] } @seqs +)[ 0..19 ]) : MCE->gather($chunk_id, @seqs); @seqs = (); }, 2, $size; MCE::Flow->finish; unmap $cache; @seqs = ( sort { $b->[1] <=> $a->[1]} @seqs )[ 0..19 ]; printf "Collatz(%5d) has sequence length of %3d steps\n", @$_ for @seqs;

    Count steps via Inline C:

    #!/usr/bin/env perl use strict; use warnings; use File::Map qw/map_anonymous unmap/; use Time::HiRes qw/usleep/; use MCE::Flow; use MCE::Candy; use Inline C => Config => CCFLAGSEX => '-O2 -fomit-frame-pointer'; use Inline C => <<'END_OF_C_CODE'; #include <stdint.h> void num_steps_c( SV* _n, SV* _s ) { uint64_t n, input; int steps = 0; n = input = SvUV(_n); while ( n != 1 && n >= input ) { n % 2 ? ( steps += 2, n = (3 * n + 1) >> 1 ) : ( steps += 1, n = n >> 1 ); } sv_setuv(_n, n); sv_setiv(_s, steps); return; } END_OF_C_CODE my $size = shift || 1e6; $size = 1e6 if $size < 1e6; # minimum $size = 1e9 if $size > 1e9; # maximum map_anonymous my $cache, $size * 2 + 2, 'shared'; # fill cache with zeroes substr($cache, 0, $size * 2 + 2, pack('s', 0) x ( $size + 1 )); # local to workers and the manager process my @seqs; sub collatz_seq { my ( $chunk_id, $seq_beg, $seq_end ) = @_; my ( $n, $steps, $tmp ); for my $input ( $seq_beg..$seq_end ) { num_steps_c($n = $input, $steps); $tmp = unpack('s', substr($cache, $n * 2, 2)); # another worker with a lesser chunk_id is not yet # completed processing $n, so pause a little # if ( $tmp == 0 && $chunk_id > 1 ) { # do { # usleep 100; # $tmp = unpack('s', substr($cache, $n * 2, 2)); # } while ( $tmp == 0 ); # } # do this instead (faster): compute $n if cache miss $tmp = _collatz($n) if $tmp == 0 && $chunk_id > 1; substr($cache, $input * 2, 2, pack('s', $steps += $tmp)); push @seqs, [ $input, $steps + 1 ] if $steps > 400; } } sub _collatz { my ( $input ) = @_; num_steps_c( my $n = $input, my $steps ); my $tmp = unpack('s', substr($cache, $n * 2, 2)); $tmp = _collatz($n) if $tmp == 0; substr($cache, $input * 2, 2, pack('s', $steps += $tmp)); return $steps } my $chunk_size; $chunk_size = int( $size / MCE::Util::get_ncpu() / 80 + 1 ); $chunk_size += 1 if $chunk_size % 2; MCE::Flow->init( max_workers => MCE::Util::get_ncpu(), chunk_size => $chunk_size, # specify 2500 if pausing above bounds_only => 1, gather => MCE::Candy::out_iter_array(\@seqs), ); mce_flow_s sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; collatz_seq($chunk_id, @{ $chunk_ref }); @seqs > 20 ? MCE->gather($chunk_id, ( sort { $b->[1] <=> $a->[1] } @seqs +)[ 0..19 ]) : MCE->gather($chunk_id, @seqs); @seqs = (); }, 2, $size; MCE::Flow->finish; unmap $cache; @seqs = ( sort { $b->[1] <=> $a->[1]} @seqs )[ 0..19 ]; printf "Collatz(%5d) has sequence length of %3d steps\n", @$_ for @seqs;

    Results: Unix time (i.e. time perl script.pl 1e9).

    1e9, 32 cores collatz_seq_before 0m34.211s collatz_seq_final 0m26.340s collatz_seq_inline_c 0m15.673s Collatz(670617279) has sequence length of 987 steps Collatz(848749995) has sequence length of 977 steps Collatz(954843745) has sequence length of 972 steps Collatz(954843751) has sequence length of 972 steps Collatz(716132809) has sequence length of 969 steps Collatz(537099606) has sequence length of 966 steps Collatz(537099607) has sequence length of 966 steps Collatz(268549803) has sequence length of 965 steps Collatz(805649409) has sequence length of 964 steps Collatz(805649410) has sequence length of 964 steps Collatz(805649411) has sequence length of 964 steps Collatz(805649415) has sequence length of 964 steps Collatz(402824705) has sequence length of 963 steps Collatz(604237057) has sequence length of 961 steps Collatz(604237058) has sequence length of 961 steps Collatz(604237059) has sequence length of 961 steps Collatz(302118529) has sequence length of 960 steps Collatz(906355586) has sequence length of 959 steps Collatz(906355587) has sequence length of 959 steps Collatz(906355588) has sequence length of 959 steps

    Regards, Mario