Re: Optimizing with Caching vs. Parallelizing (MCE::Map)

in reply to Optimizing with Caching vs. Parallelizing (MCE::Map)

Great discussions! I'm sorry I missed all this; I haven't had much time to visit here recently.

I'm the one who contributed the task, so I have some interest in this beyond my own personal fascination with the Collatz sequence.

My own solution uses memoization and a couple of other optimizations. The full code is at https://github.com/manwar/perlweeklychallenge-club/blob/master/challenge-054/ryan-thompson/perl/ch-2.pl, and my "review" of my solution is at https://perlweeklychallenge.org/blog/review-challenge-054/#ryan-thompson2. That page also contains links to, and reviews of, every solution submitted by the participants in the challenge that week. None of them used MCE::Map, which is a shame!

Data structures and bookkeeping were key to good performance on this one, for me:

my @seqlen = (-1,1);    # Memoize sequence length
my $top    = 20;        # Report this many of the top sequences
my @top    = [ -1,-1 ]; # Top $top sequences
my $upper  = 1e6;       # Upper limit starting term
my $mintop = 0;         # Lowest value in @top

GetOptions('top=i' => \$top, 'upper=i' => \$upper);
[download]

Here is the main loop:

for (my $start = 3; $start < $upper; $start += 2) {
    my ($n, $len) = ($start, 0);
    while (! defined $seqlen[$n]) {
        $len += 1 + $n % 2;
        $n = $n % 2 ? (3*$n + 1)/2 : $n / 2;
    }
    $len += $seqlen[$n];
    $seqlen[$start] = $len if $start < $upper * 2; # Cache
    top($start => $len)            if $len > $mintop  and     $start <
+= $upper;
    top($n * 2 => $seqlen[$n] + 1) if   $n < $upper/2 and $seqlen[$n] 
+> $mintop;
}
[download]

I avoid function call overhead by making sure everything is done iteratively. As another optimization, instead of simply doing 3n+1 for odd numbers, I do 3n+1/2, and increment the sequence length by two instead of one. And finally, I'm able to skip even-numbered starts with a little creative arithmetic with my second call to top().

I decided to ask people to output the top 20, because that presents an interesting mini-challenge by itself. Maintaining it naively by calling sort on a million elements at the end takes longer than the above loop, and sorting a 20-item list repeatedly is even worse. Maintaining essentially a priority queue is much faster:

sub top {
    my ($n, $len) = @_;

    my $idx = first { $top[$_][1] < $len } 0..$#top;
    splice @top, $idx, 0, [ $n, $len ];

    pop @top if @top > $top;
    $mintop = $top[-1][1];
}
[download]

The above sub is O(n), so it's not as good as a heap implementation, but it's only called when there is definitely a new element to be inserted, thanks to a bit of bookkeeping in $mintop, so I opted to keep it simple.

Performance:

real    0m0.848s
user    0m0.835s
sys     0m0.012s
[download]

Purely for crude CPU comparison purposes, Laurent's solution in Re: Optimizing with Caching vs. Parallelizing (MCE::Map) runs in 1.57 sec on the same (virtual) machine.

use strict; use warnings; omitted for brevity.

Comment on Re: Optimizing with Caching vs. Parallelizing (MCE::Map) Select or Download Code

Replies are listed 'Best First'.
Re^2: Optimizing with Caching vs. Parallelizing (MCE::Map) (traps for the unwary) by vr (Curate) on Apr 20, 2020 at 10:00 UTC
Many thanks for providing this wonderful challenge! If all goes at the pace it's rolling now, the fun may well continue into the month of May! Therefore, a warning: some people are at danger of gaining (or loosing) more than what they have bargained for!:) I decided to ask people to output the top 20, because that presents an interesting mini-challenge by itself Well said, dear rjt, well said. But it seems it were YOU, who fell into the trap that you so cunningly crafted for poor innocent learners! The top 20 out of million Collatz sequences has an insidious property: there are six "445" lengths in 1e6, but only four in top-20 (so, all six are in top-22). Which to extract? Aha! Well, the challenge did not clearly state how to order numbers with the same Collatz lengths (CL). But I think it is reasonable to assume, that, since numbers with longer CL are rated better/closer to top, then smaller numbers among producing same CL are to be valued more. Like: "Look! This brave little number commendably creates as long CL as that huge number! And this undeservedly huge number has only managed to generate so puny CL! Loser!" At least, there must be some consistency in arranging results, don't you agree? In other words, I think that if CL column descends, then numbers column must ascend (for equal CLs). See ordering for CL 450, too. Well, dear Perl users, I happened to notice this, because in another my (unpublished) solution I had to endure great pain in arranging top-20 properly. Just how many of you, looking at output of (almost all) scripts in this and related 2 threads, have noticed, that top 20 are neatly ordered? And yet, this "natural" result comes at no cost at all, with Perl! Just appreciate what you so ungratefully consume!:) To illustrate, rjt, here's, in parallel, output of your script and Laurent_R's with marioroy fixes: Collatz(837799) has sequence length of 525 steps \| 837799: 525 Collatz(626331) has sequence length of 509 steps \| 626331: 509 Collatz(939497) has sequence length of 507 steps \| 939497: 507 Collatz(704623) has sequence length of 504 steps \| 704623: 504 Collatz(910107) has sequence length of 476 steps \| 910107: 476 Collatz(927003) has sequence length of 476 steps \| 927003: 476 Collatz(511935) has sequence length of 470 steps \| 511935: 470 Collatz(767903) has sequence length of 468 steps \| 767903: 468 Collatz(796095) has sequence length of 468 steps \| 796095: 468 Collatz(970599) has sequence length of 458 steps \| 970599: 458 Collatz(546681) has sequence length of 452 steps \| 546681: 452 Collatz(820022) has sequence length of 450 steps \| 818943: 450 Collatz(818943) has sequence length of 450 steps \| 820022: 450 Collatz(820023) has sequence length of 450 steps \| 820023: 450 Collatz(410011) has sequence length of 449 steps \| 410011: 449 Collatz(615017) has sequence length of 447 steps \| 615017: 447 Collatz(922526) has sequence length of 445 steps \| 886953: 445 Collatz(922526) has sequence length of 445 steps \| 906175: 445 Collatz(886953) has sequence length of 445 steps \| 922524: 445 Collatz(906175) has sequence length of 445 steps \| 922525: 445 [download] But wait... What's that??? The 922526 number is listed twice, on the left?? Is this... ~~can't believe... is this because of infamous What Every Computer Scientist Should Know About Floating-Point Arithmetic? Because someone:) decided to cut corners and resort to FP?~~ Or is it for another reason? Didn't investigate yet. Just who would have thought that this would surface in so innocent task "calculate lengths of Collatz sequences". Wow! Great challenge! Edit. No, of course it's not floating point issue. Algo is broken. CLs for odd numbers are cached, but for even numbers they are not. Some even numbers are never passed to &top (e.g. 922524), others are pumped into this subroutine several times. The 922526 hadn't been phased out from @top by odd numbers with longer CLs. With $top large enough, there are many even dupes.	[reply] [d/l]
Re^3: Optimizing with Caching vs. Parallelizing (MCE::Map) (traps for the unwary) by rjt (Curate) on Apr 20, 2020 at 19:41 UTC
Ha! Good eye. I didn't even notice the doubled-up 922526. It's not a FP bug. Rather, it's a corner case thanks to how I combined the /2 optimization + memoization; certain even numbers get added to the p.queue twice. It can be fixed trivially by either removing the /2 optimization (simpler, ~5% penalty), or skipping seen numbers in the second call to `top()` (no measurable penalty, adds a variable). As to your interpretation of the "top 20 arrangement," I like your discussion! We try to keep the task descriptions only quasi-formal, to keep the challenge accessible to beginners, which is why you don't usually see these sorts of details specified like a requirements document. Meaning, many "minor" details are left to the discretion of the participants. The upshot of that is, if you submit a really weird interpretation, you'd probably net yourself a mildly amusing comment in my next review, at least. :-) `use strict; use warnings;` omitted for brevity.	[reply] [d/l] [select]
Re^2: Optimizing with Caching vs. Parallelizing (MCE::Map) by marioroy (Prior) on Apr 20, 2020 at 19:07 UTC
Hi rjt, Thank you for this challenge. This consumed so much of my time in a great way. The reason is partly due to, "What if possible for many CPU cores?" But first made attempts for fast using 1 core. Below are the 3 progressive solutions, each one running faster. Update: Added results from two machines. Laurent's demonstration plus updates: #!/usr/bin/env perl use strict; use warnings; my $size = shift \|\| 1e6; $size = 1e6 if $size < 1e6; # minimum $size = 1e9 if $size > 1e9; # maximum ## # Laurent's demonstration + updates # https://www.perlmonks.org/?node_id=11115520 # https://www.perlmonks.org/?node_id=11115540 # # Parallel solution # https://www.perlmonks.org/?node_id=11115544 ## my @cache = (0, 1, 2); my @seqs; sub collatz_seq { my $size = shift; my ($n, $steps); for my $input (2..$size) { $n = $input, $steps = 0; while ($n != 1) { $steps += $cache[$n], last if defined $cache[$n]; $n % 2 ? ( $steps += 2, $n = (3 * $n + 1) >> 1 ) : ( $steps += 1, $n = $n >> 1 ); } $cache[$input] = $steps if $input < $size; push @seqs, [ $input, $steps ] if $steps > 400; } } collatz_seq($size); @seqs = ( sort { $b->[1] <=> $a->[1]} @seqs )[ 0..19 ]; printf "Collatz(%5d) has sequence length of %3d steps\n", @$_ for @seqs; [download] iM71's C++ demonstration converted to Perl plus updates: #!/usr/bin/env perl use strict; use warnings; my $size = shift \|\| 1e6; $size = 1e6 if $size < 1e6; # minimum $size = 1e9 if $size > 1e9; # maximum ## # iM71's demonstration + applied T(x) notation and compression # https://stackoverflow.com/a/55361008 # https://www.youtube.com/watch?v=t1I9uHF9X5Y (1 min into video) # # Parallel solution # https://www.perlmonks.org/?node_id=11115780 ## my @cache = (0, 1, 2); my @seqs; sub collatz_seq { my $size = shift; my ($n, $steps); for my $input (2..$size) { $n = $input, $steps = 0; $n % 2 ? ( $steps += 2, $n = (3 * $n + 1) >> 1 ) : ( $steps += 1, $n = $n >> 1 ) while $n != 1 && $n >= $input; $cache[$input] = $steps += $cache[$n]; push @seqs, [ $input, $steps ] if $steps > 400; } } collatz_seq($size); @seqs = ( sort { $b->[1] <=> $a->[1]} @seqs )[ 0..19 ]; printf "Collatz(%5d) has sequence length of %3d steps\n", @$_ for @seqs; [download] Step counting using Inline C: #!/usr/bin/env perl use strict; use warnings; use Inline C => Config => CCFLAGSEX => '-O2 -fomit-frame-pointer'; use Inline C => <<'END_OF_C_CODE'; #include <stdint.h> void num_steps_c( SV* _n, SV* _s ) { uint64_t n, input; int steps = 0; n = input = SvUV(_n); while ( n != 1 && n >= input ) { n % 2 ? ( steps += 2, n = (3 * n + 1) >> 1 ) : ( steps += 1, n = n >> 1 ); } sv_setuv(_n, n); sv_setiv(_s, steps); return; } END_OF_C_CODE my $size = shift \|\| 1e6; $size = 1e6 if $size < 1e6; # minimum $size = 1e9 if $size > 1e9; # maximum ## # iM71's demonstration + applied T(x) notation and compression # https://stackoverflow.com/a/55361008 # https://www.youtube.com/watch?v=t1I9uHF9X5Y (1 min into video) # # Parallel solution # https://www.perlmonks.org/?node_id=11115780 ## my @cache = (0, 1, 2); my @seqs; sub collatz_seq { my $size = shift; my ($n, $steps); for my $input (2..$size) { num_steps_c($n = $input, $steps); $cache[$input] = $steps += $cache[$n]; push @seqs, [ $input, $steps ] if $steps > 400; } } collatz_seq($size); @seqs = ( sort { $b->[1] <=> $a->[1]} @seqs )[ 0..19 ]; printf "Collatz(%5d) has sequence length of %3d steps\n", @$_ for @seqs; [download] Results from two machines: 64-bit VM: rjt 0.903s Laurent + updates 0.696s iM71 + updates 0.602s Step counting in C 0.273s (1st time involves compiling) AMD 3970x: rjt 0.635s Laurent + updates 0.516s iM71 + updates 0.467s Step counting in C 0.191s (1st time involves compiling) Collatz(837799) has sequence length of 525 steps Collatz(626331) has sequence length of 509 steps Collatz(939497) has sequence length of 507 steps Collatz(704623) has sequence length of 504 steps Collatz(910107) has sequence length of 476 steps Collatz(927003) has sequence length of 476 steps Collatz(511935) has sequence length of 470 steps Collatz(767903) has sequence length of 468 steps Collatz(796095) has sequence length of 468 steps Collatz(970599) has sequence length of 458 steps Collatz(546681) has sequence length of 452 steps Collatz(818943) has sequence length of 450 steps Collatz(820022) has sequence length of 450 steps Collatz(820023) has sequence length of 450 steps Collatz(410011) has sequence length of 449 steps Collatz(615017) has sequence length of 447 steps Collatz(886953) has sequence length of 445 steps Collatz(906175) has sequence length of 445 steps Collatz(922524) has sequence length of 445 steps Collatz(922525) has sequence length of 445 steps [download] Regards, Mario	[reply] [d/l] [select]
Re^3: Optimizing with Caching vs. Parallelizing (MCE::Map) by rjt (Curate) on Apr 20, 2020 at 20:48 UTC
This is great, Mario (and everyone else in this thread, for that matter)! The multicore work is fantastic. I'm very impressed by the level of interest and dedication this "little" question generated. Hopefully we can come up with a few more like it. (And anyone can suggest challenges, by the way.) `use strict; use warnings;` omitted for brevity.	[reply] [d/l]
Re^4: Optimizing with Caching vs. Parallelizing (MCE::Map) by marioroy (Prior) on Apr 20, 2020 at 23:49 UTC
Hi rjt and fellow Monks, I updated the parallel demonstrations here and here to ensure orderly output plus cache miss update for parallel iM71. Then captured results for 1e8. Note that running parallel involves File::Map, pack, and unpack. Running Inline::C involves compiling C code on the first run. Testing was done on a Windows 10 host inside a Docker container running Ubuntu 18.04.x and Perl 5.30.1. The hardware is an AMD 3970x box (32-cores with SMT disabled). 1e8 Output: Collatz(63728127) has sequence length of 950 steps Collatz(95592191) has sequence length of 948 steps Collatz(96883183) has sequence length of 811 steps Collatz(86010015) has sequence length of 798 steps Collatz(98110761) has sequence length of 749 steps Collatz(73583070) has sequence length of 746 steps Collatz(73583071) has sequence length of 746 steps Collatz(36791535) has sequence length of 745 steps Collatz(55187303) has sequence length of 743 steps Collatz(56924955) has sequence length of 743 steps Collatz(82780955) has sequence length of 741 steps Collatz(85387433) has sequence length of 741 steps Collatz(63101607) has sequence length of 738 steps Collatz(64040575) has sequence length of 738 steps Collatz(93128574) has sequence length of 736 steps Collatz(93128575) has sequence length of 736 steps Collatz(94652411) has sequence length of 736 steps Collatz(96060863) has sequence length of 736 steps Collatz(46564287) has sequence length of 735 steps Collatz(69846431) has sequence length of 733 steps [download] Performance: 1e8: parallel, 32 cores (File::Map, pack, unpack): https://www.perlmonks.org/?node_id=11115544 https://www.perlmonks.org/?node_id=11115780 Laurent + updates 3.474s iM71 + updates 2.701s Step counting in C 1.654s 1e8: parallel, 16 cores Laurent + updates 6.219s iM71 + updates 4.787s Step counting in C 2.793s 1e8: parallel, 8 cores Laurent + updates 12.061s iM71 + updates 9.200s Step counting in C 5.258s 1e8: parallel, 4 cores Laurent + updates 23.615s iM71 + updates 17.935s Step counting in C 10.056s 1e8: parallel, 2 cores Laurent + updates 46.146s iM71 + updates 34.342s Step counting in C 19.084s 1e8: non-parallel (Array): https://www.perlmonks.org/?node_id=11115841 Laurent + updates 53.961s iM71 + updates 48.673s Step counting in C 19.023s [download] Parallel now matches serial for sequences with equal number of steps (i.e. smallest sequence first). Regards, Mario	[reply] [d/l] [select]

In Section Meditations