Re: Optimizing with Caching vs. Parallelizing (MCE::Map)

Hi dear fellow monks,

as a follow-up to my previous post, I implemented my new caching strategy, consisting in storing in the cache the length of the sequences rather than the full sequences.

On the computer where I'm running my tests right now, my original program has the following timing:

$ time perl collatz.pl
837799: 525
626331: 509
[lines omitted for brevity]
906175: 445
922524: 445
922525: 445

real    1m37,551s
user    1m9,375s
sys     0m21,031s
[download]

My laptop is obviously much slower than Nick's computer (where this program took 22 seconds).

This is now my first implementation with the new caching strategy:

#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;
use constant MAX => 1000000;

my %cache = (2 => 2);

sub collatz_seq {
    my $input = shift;
    my $n = $input;
    my $result = 0;
    while ($n != 1) {
        if (exists $cache{$n}) {
            $result += $cache{$n};
            last;
        } else {
            my $new_n = $n % 2 ? 3 * $n + 1 : $n / 2;
            $result++;
            $cache{$n} = $cache{$new_n} + 1 
                if defined $cache{$new_n} and $n < MAX;
            $n = $new_n;
        }
    }
    $cache{$input} = $result if $input < MAX;
    return $result;
}

my @long_seqs;
for my $num (1..1000000) {
    my $seq_length = collatz_seq $num;
    push @long_seqs, [ $num, $seq_length ] if $seq_length > 400;
}

@long_seqs = sort { $b->[1] <=> $a->[1]} @long_seqs;
say  "$_->[0]: $_->[1]" for @long_seqs[0..19];
[download]

This program produces the same outcome, but is nearly 3 times faster:

real    0m34,207s
user    0m34,108s
sys     0m0,124s
[download]

But we now end up with a cache having essentially one entry per input number in the 1..1_000_000 range. So, I thought, perhaps it might be better to use an array, rather than a hash, for the cache (accessing an array item should be faster than a hash lookup).

This is the code for this new implementation:

#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;
use constant MAX => 1000000;

my @cache = (0, 1, 2);

sub collatz_seq {
    my $input = shift;
    my $n = $input;
    my $result = 0;
    while ($n != 1) {
        if (defined $cache[$n]) {
            $result += $cache[$n];
            last;
        } else {
            my $new_n = $n % 2 ? 3 * $n + 1 : $n / 2;
            $result++;
            $cache[$n] = $cache[$new_n] + 1 
                if defined $cache[$new_n] and $n < MAX;
            $n = $new_n;
        }
    }
    $cache[$input] = $result if $input < MAX;
    return $result;
}

my @long_seqs;
for my $num (1..1000000) {
    my $seq_length = collatz_seq $num;
    push @long_seqs, [ $num, $seq_length ] if $seq_length > 400;
}

@long_seqs = sort { $b->[1] <=> $a->[1]} @long_seqs;
say  "$_->[0]: $_->[1]" for @long_seqs[0..19];
[download]

With this new implementation, we still obtain the same result, but the program is now more than 55 times faster than my original one (and almost 20 times faster than the solution using a hash for the cache):

$ time perl collatz3.pl
837799: 525
626331: 509
[Lines omitted for brevity]
922524: 445
922525: 445

real    0m1,755s
user    0m1,687s
sys     0m0,061s
[download]

I strongly suspected that using an array would be faster, but I frankly did not expect such a huge gain until I tested it.

So, it is true that throwing more CPU cores at the problem makes the solution faster (although to a limited extent with my computer that has only 4 cores). But using a better algorithm can often be a better solution.

Comment on Re: Optimizing with Caching vs. Parallelizing (MCE::Map) Select or Download Code

Replies are listed 'Best First'.
Re^2: Optimizing with Caching vs. Parallelizing (MCE::Map) by marioroy (Prior) on Apr 14, 2020 at 23:21 UTC
Hi Laurent_R and fellow monks, Nice speedup! I applied 3 optimizations in preparation for a follow-up post combining caching with parallelization. This was done because File::Map adds overhead (i.e. it will not be as fast as an array). So, I asked myself can anything be done to further improve the serial implementation. And if yes, is it helpful. I provide the run time for each stage at the end of this post. It turns out that improvements were possible and will be converting collatz3_d.pl to use File::Map for the parallel demonstration. Update: Further improvements, Step 4. 1. Replaced division by 2. `$n >> 1;` [download] 2. Removed one level of branching. `while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; my $new_n = $n % 2 ? 3 * $n + 1 : $n >> 1; $result++; $cache[$n] = $cache[$new_n] + 1 if defined $cache[$new_n] and $n < $max; $n = $new_n; }` [download] 3. Then reduced the number of loop iterations. Credit for reducing the # of loop iterations was from watching Notation and compressed dynamics, one minute into it (i.e. the T(x) notation). `while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; $n % 2 ? ( $result += 2, $new_n = (3 * $n + 1) >> 1 ) : ( $result += 1, $new_n = $n >> 1 ); $cache[$n] = $cache[$new_n] + ($n % 2 ? 2 : 1) if defined $cache[$new_n] and $n < $max; $n = $new_n; }` [download] 4. Finally, less caching. `while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; $n % 2 ? ( $result += 2, $new_n = (3 * $n + 1) >> 1 ) : ( $result += 1, $new_n = $n >> 1 ); $n = $new_n; }` [download] The final code optionally takes an argument. #!/usr/bin/perl use strict; use warnings; use feature qw/say/; my $max = shift \|\| 1e6; $max = 1e6 if $max < 1e6; my @cache = (0, 1, 2); sub collatz_seq { my $input = shift; my $n = $input; my $result = 0; my $new_n; while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; $n % 2 ? ( $result += 2, $new_n = (3 * $n + 1) >> 1 ) : ( $result += 1, $new_n = $n >> 1 ); $n = $new_n; } $cache[$input] = $result if $input < $max; return $result; } my @long_seqs; for my $num (1..$max) { my $seq_length = collatz_seq $num; push @long_seqs, [ $num, $seq_length ] if $seq_length > 400; } @long_seqs = sort { $b->[1] <=> $a->[1]} @long_seqs; say "$_->[0]: $_->[1]" for @long_seqs[0..19]; [download] Results: $ time perl collatz3_a.pl 1e7 # Intel i7 laptop, Docker Container, Ubuntu + Perlbrew Perl 5.30.1 collatz3_a.pl 1e7 32.291s (a) original code, accepts argument collatz3_b.pl 1e7 30.134s (b) a + replaced division with >> 1 collatz3_c.pl 1e7 28.503s (c) b + removed 1 level of branching collatz3_d.pl 1e7 21.464s (d) c + reduced loop iterations collatz3_e.pl 1e7 19.357s (e) d + less caching # AMD 3970x, Docker Container, Ubuntu + Perlbrew Perl 5.30.1 collatz3_a.pl 1e7 13.130s (a) original code, accepts argument collatz3_b.pl 1e7 12.394s (b) a + replaced division with >> 1 collatz3_c.pl 1e7 12.261s (c) b + removed 1 level of branching collatz3_d.pl 1e7 9.170s (d) c + reduced loop iterations collatz3_e.pl 1e7 7.681s (e) d + less caching 8400511: 686 8865705: 668 6649279: 665 9973919: 663 6674175: 621 7332399: 616 7532665: 616 5649499: 613 8474249: 611 6355687: 608 8847225: 606 9533531: 606 6635419: 603 9953129: 601 7464846: 598 7464847: 598 3732423: 597 5598635: 595 8397953: 593 6298465: 590 [download] Regards, Mario	[reply] [d/l] [select]
Re^3: Optimizing with Caching vs. Parallelizing (MCE::Map) by Laurent_R (Canon) on Apr 16, 2020 at 12:57 UTC
Hi Mario, thanks a lot for your comments and suggestions. I did not expect such micro-optimizations to bring such a significant performance improvement, that's interesting. Especially replacing the division by 2 by a bit shift is the type of thing that I had stopped doing decades ago, when I figured that the C compiler I was using at the time was doing this type of optimization at least as well and often better than I was able to do. Good to be reminded that the optimizations you suggested can also be quite useful. Thank you. Following this very interesting thread and your new thread on MCE::Flow + Caching via File::Map, I made a new blog post on the Collatz sequence, trying to summarize some of the findings: http://blogs.perl.org/users/laurent_r/2020/04/revisiting-the-collatz-sequence-pwc-54.html. I did not try to explain your demonstration using `File::Map` for caching with parallel processing (your other thread) because I'm not sure to fully understand everything and was afraid of mis-representing your work. People can follow the link and read your own words. Thank you very much for your challenging ideas.	[reply] [d/l]
Re^4: Optimizing with Caching vs. Parallelizing (MCE::Map) by marioroy (Prior) on Apr 17, 2020 at 00:00 UTC
Hi Laurent, Thank you, for your caching algorithm. It has kept me busy after hours. I added Step 4 for less caching. Basically, commenting 2 lines. I updated my posts here and here. `sub collatz_seq { my $input = shift; my $n = $input; my $result = 0; my $new_n; while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; $n % 2 ? ( $result += 2, $new_n = (3 * $n + 1) >> 1 ) : ( $result += 1, $new_n = $n >> 1 ); # $cache[$n] = $cache[$new_n] + ($n % 2 ? 2 : 1) # if defined $cache[$new_n] and $n < $max; $n = $new_n; } $cache[$input] = $result if $input < $max; return $result; }` [download] A new member collatz3_e was added to the list. # Intel i7 laptop, Docker Container, Ubuntu + Perlbrew Perl 5.30.1 collatz3_a.pl 1e7 32.291s (a) original, accepts argument collatz3_b.pl 1e7 30.134s (b) a + replaced division with >> 1 collatz3_c.pl 1e7 28.503s (c) b + removed 1 level of branching collatz3_d.pl 1e7 21.464s (d) c + reduced loop iterations collatz3_e.pl 1e7 19.357s (e) d + caching less # AMD 3970x, Docker Container, Ubuntu + Perlbrew Perl 5.30.1 collatz3_a.pl 1e7 13.130s (a) original, accepts argument collatz3_b.pl 1e7 12.394s (b) a + replaced division with >> 1 collatz3_c.pl 1e7 12.261s (c) b + removed 1 level of branching collatz3_d.pl 1e7 9.170s (d) c + reduced loop iterations collatz3_e.pl 1e7 7.681s (e) d + caching less [download] The 32-core machine reaches below 0.5 seconds for size 1e7. That includes the time to launch Perl, load modules, spin up and reap workers. The Collatz Conjenture took over me. And finally am able to move on. Kind regards, Mario	[reply] [d/l] [select]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks