http://qs321.pair.com?node_id=11115540


in reply to Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
in thread Optimizing with Caching vs. Parallelizing (MCE::Map)

Hi Laurent_R and fellow monks,

Nice speedup! I applied 3 optimizations in preparation for a follow-up post combining caching with parallelization. This was done because File::Map adds overhead (i.e. it will not be as fast as an array). So, I asked myself can anything be done to further improve the serial implementation. And if yes, is it helpful. I provide the run time for each stage at the end of this post. It turns out that improvements were possible and will be converting collatz3_d.pl to use File::Map for the parallel demonstration.

Update: Further improvements, Step 4.

1. Replaced division by 2.

$n >> 1;

2. Removed one level of branching.

while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; my $new_n = $n % 2 ? 3 * $n + 1 : $n >> 1; $result++; $cache[$n] = $cache[$new_n] + 1 if defined $cache[$new_n] and $n < $max; $n = $new_n; }

3. Then reduced the number of loop iterations. Credit for reducing the # of loop iterations was from watching Notation and compressed dynamics, one minute into it (i.e. the T(x) notation).

while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; $n % 2 ? ( $result += 2, $new_n = (3 * $n + 1) >> 1 ) : ( $result += 1, $new_n = $n >> 1 ); $cache[$n] = $cache[$new_n] + ($n % 2 ? 2 : 1) if defined $cache[$new_n] and $n < $max; $n = $new_n; }

4. Finally, less caching.

while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; $n % 2 ? ( $result += 2, $new_n = (3 * $n + 1) >> 1 ) : ( $result += 1, $new_n = $n >> 1 ); $n = $new_n; }

The final code optionally takes an argument.

#!/usr/bin/perl use strict; use warnings; use feature qw/say/; my $max = shift || 1e6; $max = 1e6 if $max < 1e6; my @cache = (0, 1, 2); sub collatz_seq { my $input = shift; my $n = $input; my $result = 0; my $new_n; while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; $n % 2 ? ( $result += 2, $new_n = (3 * $n + 1) >> 1 ) : ( $result += 1, $new_n = $n >> 1 ); $n = $new_n; } $cache[$input] = $result if $input < $max; return $result; } my @long_seqs; for my $num (1..$max) { my $seq_length = collatz_seq $num; push @long_seqs, [ $num, $seq_length ] if $seq_length > 400; } @long_seqs = sort { $b->[1] <=> $a->[1]} @long_seqs; say "$_->[0]: $_->[1]" for @long_seqs[0..19];

Results:

$ time perl collatz3_a.pl 1e7 # Intel i7 laptop, Docker Container, Ubuntu + Perlbrew Perl 5.30.1 collatz3_a.pl 1e7 32.291s (a) original code, accepts argument collatz3_b.pl 1e7 30.134s (b) a + replaced division with >> 1 collatz3_c.pl 1e7 28.503s (c) b + removed 1 level of branching collatz3_d.pl 1e7 21.464s (d) c + reduced loop iterations collatz3_e.pl 1e7 19.357s (e) d + less caching # AMD 3970x, Docker Container, Ubuntu + Perlbrew Perl 5.30.1 collatz3_a.pl 1e7 13.130s (a) original code, accepts argument collatz3_b.pl 1e7 12.394s (b) a + replaced division with >> 1 collatz3_c.pl 1e7 12.261s (c) b + removed 1 level of branching collatz3_d.pl 1e7 9.170s (d) c + reduced loop iterations collatz3_e.pl 1e7 7.681s (e) d + less caching 8400511: 686 8865705: 668 6649279: 665 9973919: 663 6674175: 621 7332399: 616 7532665: 616 5649499: 613 8474249: 611 6355687: 608 8847225: 606 9533531: 606 6635419: 603 9953129: 601 7464846: 598 7464847: 598 3732423: 597 5598635: 595 8397953: 593 6298465: 590

Regards, Mario

Replies are listed 'Best First'.
Re^3: Optimizing with Caching vs. Parallelizing (MCE::Map)
by Laurent_R (Canon) on Apr 16, 2020 at 12:57 UTC
    Hi Mario,

    thanks a lot for your comments and suggestions.

    I did not expect such micro-optimizations to bring such a significant performance improvement, that's interesting. Especially replacing the division by 2 by a bit shift is the type of thing that I had stopped doing decades ago, when I figured that the C compiler I was using at the time was doing this type of optimization at least as well and often better than I was able to do. Good to be reminded that the optimizations you suggested can also be quite useful. Thank you.

    Following this very interesting thread and your new thread on MCE::Flow + Caching via File::Map, I made a new blog post on the Collatz sequence, trying to summarize some of the findings: http://blogs.perl.org/users/laurent_r/2020/04/revisiting-the-collatz-sequence-pwc-54.html. I did not try to explain your demonstration using File::Map for caching with parallel processing (your other thread) because I'm not sure to fully understand everything and was afraid of mis-representing your work. People can follow the link and read your own words.

    Thank you very much for your challenging ideas.

      Hi Laurent,

      Thank you, for your caching algorithm. It has kept me busy after hours. I added Step 4 for less caching. Basically, commenting 2 lines.

      I updated my posts here and here.

      sub collatz_seq { my $input = shift; my $n = $input; my $result = 0; my $new_n; while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; $n % 2 ? ( $result += 2, $new_n = (3 * $n + 1) >> 1 ) : ( $result += 1, $new_n = $n >> 1 ); # $cache[$n] = $cache[$new_n] + ($n % 2 ? 2 : 1) # if defined $cache[$new_n] and $n < $max; $n = $new_n; } $cache[$input] = $result if $input < $max; return $result; }

      A new member collatz3_e was added to the list.

      # Intel i7 laptop, Docker Container, Ubuntu + Perlbrew Perl 5.30.1 collatz3_a.pl 1e7 32.291s (a) original, accepts argument collatz3_b.pl 1e7 30.134s (b) a + replaced division with >> 1 collatz3_c.pl 1e7 28.503s (c) b + removed 1 level of branching collatz3_d.pl 1e7 21.464s (d) c + reduced loop iterations collatz3_e.pl 1e7 19.357s (e) d + caching less # AMD 3970x, Docker Container, Ubuntu + Perlbrew Perl 5.30.1 collatz3_a.pl 1e7 13.130s (a) original, accepts argument collatz3_b.pl 1e7 12.394s (b) a + replaced division with >> 1 collatz3_c.pl 1e7 12.261s (c) b + removed 1 level of branching collatz3_d.pl 1e7 9.170s (d) c + reduced loop iterations collatz3_e.pl 1e7 7.681s (e) d + caching less

      The 32-core machine reaches below 0.5 seconds for size 1e7. That includes the time to launch Perl, load modules, spin up and reap workers.

      The Collatz Conjenture took over me. And finally am able to move on.

      Kind regards, Mario