Mon cher ami Laurent_R recently blogged about his solution to the "extra credit" problem in the Perl Weekly Challenge #54. He showed a solution using memoizing, or caching, to reduce the number of repeated calculations made by a program.
I wondered about the strategy. Obviously calculating the sequences for numbers up to 1,000,000 without some optimization would be painfully or maybe unworkably slow. But the task looks computation-intensive, so I wanted to see whether more cycles would be more beneficial than caching.
Here is the solution presented by Laurent:
#!/usr/bin/perl use strict; use warnings; use feature qw/say/; use Data::Dumper; use constant MAX => 300000; my %cache; sub collatz_seq { my $input = shift; my $n = $input; my @result; while ($n != 1) { if (exists $cache{$n}) { push @result, @{$cache{$n}}; last; } else { my $new_n = $n % 2 ? 3 * $n + 1 : $n / 2; push @result, $new_n; $cache{$n} = [$new_n, @{$cache{$new_n}}] if defined ($cache{$new_n}) and $n < MAX; $n = $new_n; } } $cache{$input} = [@result] if $n < MAX; return @result; } my @long_seqs; for my $num (1..1000000) { my @seq = ($num, collatz_seq $num); push @long_seqs, [ $num, scalar @seq] if scalar @seq > 400; } @long_seqs = sort { $b->[1] <=> $a->[1]} @long_seqs; say "$_->[0]: $_->[1]" for @long_seqs[0..19]; # say "@{$cache{$long_seqs[0][0]}}";
This runs on my system pretty quickly:
real 0m22.596s user 0m21.530s sys 0m1.045s
Next I ran the following version using mce_map_s from MCE::Map. mce_map_s is an implementation of the parallelized map functionality provided by MCE::Map, optimized for sequences. Each worker is handed only the beginning and end of the chunk of the sequence it will process, and workers communicate amongst themselves to keep track of the overall task. When using mce_map_s, pass only the beginning and end of the sequence to process (also, optionally, the step interval and format).
use strict; use warnings; use feature 'say'; use Data::Dumper; use MCE::Map; my @output = mce_map_s { my $input = $_; my $n = $input; my @result = $input; while ( $n != 1 ) { $n = $n % 2 ? 3 * $n + 1 : $n / 2; push @result, $n; } return [ $input, scalar @result ]; } 1, 1000000; MCE::Map->finish; @output = sort { $b->[1] <=> $a->[1] } @output; say sprintf('%s : length %s', $_->[0], $_->[1]) for @output[0..19];
This program, with no caching, runs on my system about five times faster (I have a total of 12 cores):
real 0m4.322s user 0m27.992s sys 0m0.170s
Notably, reducing the number of workers to just two still ran the program in less than half the time than Laurent's single-process memoized version. Even running with one process, with no cache, was faster. This is no doubt due to the fact MCE uses chunking by default. Even with one worker the list of one million numbers was split by MCE into chunks of 8,000.
Next, I implemented Laurent's cache strategy, but using MCE::Shared::Hash. I wasn't really surprised that the program then ran much slower than either previous version. The reason, of course, is that this task pretty much only makes use of the CPU, so while throwing more cycles at it it a huge boost, sharing data among the workers - precisely because the task is almost 100% CPU-bound - only slows them down. Modern CPUs are very fast at crunching numbers.
I was curious about how busy the cache was in this case, so I wrapped the calls to assign to and read from the hash in Laurent's program in a sub so I could count them. The wrappers look like:
my %cache; my $sets = my $gets = 0; sub cache_has { $gets++; exists $cache{$_[0]} } sub cache_set { $sets++; $cache{$_[0]} = $_[1] } sub cache_get { $gets++; $cache{$_[0]} }
The result:
That's a lot of back and forth.Sets: 659,948 Gets: 16,261,635
So the moral of the story is that while caching is often useful when you are going to make the same calculations over and over, sometimes the cost of the caching exceeds the cost of just making the calculations repeatedly.
Hope this is of interest!
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by marioroy (Prior) on Apr 06, 2020 at 05:55 UTC | |
Hi 1nickt, One may simulate extra CPU time to better understand the benefit of caching. I converted Laurent_R's example to consume multiple CPU cores. Laurent's example incurs extra overhead due to calling a subroutine inside the loop and returning a full list. I changed the number of iterations from 1 million to 100 thousand now that simulating extra CPU cycles. Nick's example
Laurent's example - no caching
Laurent's example - with caching
Program output
Time to run with 12 cores using a Linux VM Notice the extra 4 seconds user time (likely function overhead inside loop), but not felt in real time (wall clock time).
Caching is helpful here (5x faster) because I added sleep to simulate more CPU cycles. However, caching is not helpful when the overhead involved is greater than the computation itself as demonstrated by 1nickt's comparison above. Regards, Mario | [reply] [d/l] [select] |
Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by davido (Cardinal) on Apr 05, 2020 at 21:47 UTC | |
Hi Nick. I wonder if you could use an algorithm that could actually benefit both from parallel processing and from caching. If you take your data set and interleaf it /8 or /12 (for your 12 cores), and then do the caching within each interleaf, you will have to regenerate the cache for each worker, but each worker maintains its own cache so that there's no per-iteration write back.iteration 1 iteration 2 iteration 3 iteration 4 worker 1 1 9 17 25 worker 2 2 10 18 26 worker 3 3 11 19 27 worker 4 4 12 20 28 worker 5 5 13 21 29 worker 6 6 14 22 30 worker 7 7 15 23 31 worker 8 8 16 24 32 So worker 1 gets 1, 9, 17, 25, ... to work on. Worker 2 gets 2, 10, 18, 26. Worker 3 gets 3, 11, 19, 27. And so on. Within each worker implement caching. If I were using Parallel::ForkManager I would have them each return a structure identifying the worker starting offset, and the sequences. That way you don't have to compose them back into a master structure at the end for output, you can sequence them round-robin based on the starting offset of each. I've sat down a few times today with the intent to give it a try, but keep getting interrupted before I get very far into it. So it's probably not going to happen. Interleafing ought to allow each worker to build a cache pretty quickly. Dave | [reply] |
by 1nickt (Canon) on Apr 06, 2020 at 01:00 UTC | |
Hi davido, I did indeed try that approach, and while I did not interleave the sequence, as I mentioned MCE chunks it up. (I don't think interleaving would really much affect the cache hits, as the caching tested stores the sequence for each number found, and the algorithm produces sequences that themselves include numbers greater than n, so such a boundary would be difficult to enforce.) In testing, with each worker building and using its own cache for the chuck processed, the program ran almost three times slower than just letting the workers hammer the CPU. Again, I believe it's because in this case the overhead of caching outweighs its benefits.
The way forward always starts with a minimal test.
| [reply] [d/l] |
Re: Optimizing with Caching vs. Parallelizing (MCE::Map) (Better caching?)
by vr (Curate) on Apr 07, 2020 at 12:43 UTC | |
FWIW, here's a better single-threaded caching algorithm, which at my machine with 4 puny little cores outperforms others I checked:
For comparison:
Apparently, the last one will beat my "better caching" if run on 6 or more cores, but, like I said, I have only 4 :) | [reply] [d/l] |
by marioroy (Prior) on Apr 08, 2020 at 02:40 UTC | |
Hi vr, Wow! Curiosity moved me to try with MCE. But first, I optimized Nick's code. Also, workers now send only the relevant data and not the entire list. This decreases the time needed for the manager process to output results. Update 1: Slight optimization to Choroba's example (i.e. >> 1). Added collatz_compress. Demo choroba, 1nickt, compress and count all in one.
Demo caching by vr
Run time on a 32 core machine - SMT disabled The times include launching Perl, loading modules, spawning workers, reaping workers, and output (~ 0.100 seconds). See the next post for the Inline C demonstration.
Interesting. For vr's demo, every worker starts with an empty cache. Meaning that workers do not have cached results from prior chunks. This is the reason not scaling versus the non-cache demonstrations. collatz_count (2 cores) reaches collatz_vr (1 core). Kind regards, Mario | [reply] [d/l] [select] |
by marioroy (Prior) on Apr 11, 2020 at 04:37 UTC | |
Hi, There was one thing left to try and that is using Inline::C. I captured the top 20 for 1e9 and 1e10. Update 1: Set chunk_size to reduce IPC. Tried staying within CPU cache by keeping @ret small. 1e10 now completes in 4 minutes (20% improvement). Update 2: Added collatz_count_c2 using GCC compiler intrinsics.
Output
Look at C go! Regards, Mario | [reply] [d/l] [select] |
by vr (Curate) on Apr 08, 2020 at 21:40 UTC | |
Hi marioroy, I think I'm having an issue with MCE on Windows here, timing your code (collatz_vr):
1st measurement approximately matches your output, but 1.5 - 2 seconds per worker to shutdown doesn't look OK to me. For vr's demo, every worker starts with an empty cache. Meaning that workers do not have cached results from prior chunks. This is the reason not scaling as well versus the non-cache demonstrations In other words, lots and lots of work is needlessly duplicated.
So, effectively, in situation with e.g. 10 workers, 4 junior workers, filling cache for ranges up to 400_000, are free to slack off and not gather their results at all, and there's large amount of overlap in work of 6 senior workers, too. For now, I have no solid idea how to parallelize this algorithm efficiently. | [reply] [d/l] [select] |
by marioroy (Prior) on Apr 09, 2020 at 08:30 UTC | |
by 1nickt (Canon) on Apr 07, 2020 at 15:54 UTC | |
Impressive. ++
The way forward always starts with a minimal test.
| [reply] |
Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by choroba (Cardinal) on Apr 06, 2020 at 09:32 UTC | |
Interestingly, the difference is about 15%, see my blogpost about the same task.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] |
by marioroy (Prior) on Apr 06, 2020 at 15:52 UTC | |
Hi choroba, I tried using multiple cores for your demonstration. Perl has this capability. So little effort and practically transparent. :)
Before and after on a 6 core machine with hyper-threading - 12 logical cores.
Regards, Mario | [reply] [d/l] [select] |
by choroba (Cardinal) on Apr 06, 2020 at 16:23 UTC | |
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] |
by marioroy (Prior) on Apr 06, 2020 at 18:19 UTC | |
by 1nickt (Canon) on Apr 06, 2020 at 18:49 UTC | |
by marioroy (Prior) on Apr 07, 2020 at 00:48 UTC | |
Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by rjt (Curate) on Apr 19, 2020 at 23:31 UTC | |
Great discussions! I'm sorry I missed all this; I haven't had much time to visit here recently. I'm the one who contributed the task, so I have some interest in this beyond my own personal fascination with the Collatz sequence. My own solution uses memoization and a couple of other optimizations. The full code is at https://github.com/manwar/perlweeklychallenge-club/blob/master/challenge-054/ryan-thompson/perl/ch-2.pl, and my "review" of my solution is at https://perlweeklychallenge.org/blog/review-challenge-054/#ryan-thompson2. That page also contains links to, and reviews of, every solution submitted by the participants in the challenge that week. None of them used MCE::Map, which is a shame! Data structures and bookkeeping were key to good performance on this one, for me:
Here is the main loop:
I avoid function call overhead by making sure everything is done iteratively. As another optimization, instead of simply doing 3n+1 for odd numbers, I do 3n+1/2, and increment the sequence length by two instead of one. And finally, I'm able to skip even-numbered starts with a little creative arithmetic with my second call to top(). I decided to ask people to output the top 20, because that presents an interesting mini-challenge by itself. Maintaining it naively by calling sort on a million elements at the end takes longer than the above loop, and sorting a 20-item list repeatedly is even worse. Maintaining essentially a priority queue is much faster:
The above sub is O(n), so it's not as good as a heap implementation, but it's only called when there is definitely a new element to be inserted, thanks to a bit of bookkeeping in $mintop, so I opted to keep it simple. Performance:
Purely for crude CPU comparison purposes, Laurent's solution in Re: Optimizing with Caching vs. Parallelizing (MCE::Map) runs in 1.57 sec on the same (virtual) machine.
use strict; use warnings; omitted for brevity.
| [reply] [d/l] [select] |
by vr (Curate) on Apr 20, 2020 at 10:00 UTC | |
Many thanks for providing this wonderful challenge! If all goes at the pace it's rolling now, the fun may well continue into the month of May! Therefore, a warning: some people are at danger of gaining (or loosing) more than what they have bargained for!:) I decided to ask people to output the top 20, because that presents an interesting mini-challenge by itself Well said, dear rjt, well said. But it seems it were YOU, who fell into the trap that you so cunningly crafted for poor innocent learners! The top 20 out of million Collatz sequences has an insidious property: there are six "445" lengths in 1e6, but only four in top-20 (so, all six are in top-22). Which to extract? Aha! Well, the challenge did not clearly state how to order numbers with the same Collatz lengths (CL). But I think it is reasonable to assume, that, since numbers with longer CL are rated better/closer to top, then smaller numbers among producing same CL are to be valued more. Like: "Look! This brave little number commendably creates as long CL as that huge number! And this undeservedly huge number has only managed to generate so puny CL! Loser!" At least, there must be some consistency in arranging results, don't you agree? In other words, I think that if CL column descends, then numbers column must ascend (for equal CLs). See ordering for CL 450, too. Well, dear Perl users, I happened to notice this, because in another my (unpublished) solution I had to endure great pain in arranging top-20 properly. Just how many of you, looking at output of (almost all) scripts in this and related 2 threads, have noticed, that top 20 are neatly ordered? And yet, this "natural" result comes at no cost at all, with Perl! Just appreciate what you so ungratefully consume!:) To illustrate, rjt, here's, in parallel, output of your script and Laurent_R's with marioroy fixes:
But wait... What's that??? The 922526 number is listed twice, on the left?? Is this... | [reply] [d/l] |
by rjt (Curate) on Apr 20, 2020 at 19:41 UTC | |
Ha! Good eye. I didn't even notice the doubled-up 922526. It's not a FP bug. Rather, it's a corner case thanks to how I combined the /2 optimization + memoization; certain even numbers get added to the p.queue twice. It can be fixed trivially by either removing the /2 optimization (simpler, ~5% penalty), or skipping seen numbers in the second call to top() (no measurable penalty, adds a variable). As to your interpretation of the "top 20 arrangement," I like your discussion! We try to keep the task descriptions only quasi-formal, to keep the challenge accessible to beginners, which is why you don't usually see these sorts of details specified like a requirements document. Meaning, many "minor" details are left to the discretion of the participants. The upshot of that is, if you submit a really weird interpretation, you'd probably net yourself a mildly amusing comment in my next review, at least. :-)
use strict; use warnings; omitted for brevity.
| [reply] [d/l] [select] |
by marioroy (Prior) on Apr 20, 2020 at 19:07 UTC | |
Hi rjt, Thank you for this challenge. This consumed so much of my time in a great way. The reason is partly due to, "What if possible for many CPU cores?" But first made attempts for fast using 1 core. Below are the 3 progressive solutions, each one running faster. Update: Added results from two machines. Laurent's demonstration plus updates:
iM71's C++ demonstration converted to Perl plus updates:
Step counting using Inline C:
Results from two machines:
Regards, Mario | [reply] [d/l] [select] |
by rjt (Curate) on Apr 20, 2020 at 20:48 UTC | |
This is great, Mario (and everyone else in this thread, for that matter)! The multicore work is fantastic. I'm very impressed by the level of interest and dedication this "little" question generated. Hopefully we can come up with a few more like it. (And anyone can suggest challenges, by the way.)
use strict; use warnings; omitted for brevity.
| [reply] [d/l] |
by marioroy (Prior) on Apr 20, 2020 at 23:49 UTC | |
Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by Laurent_R (Canon) on Apr 14, 2020 at 13:53 UTC | |
as a follow-up to my previous post, I implemented my new caching strategy, consisting in storing in the cache the length of the sequences rather than the full sequences. On the computer where I'm running my tests right now, my original program has the following timing: My laptop is obviously much slower than Nick's computer (where this program took 22 seconds). This is now my first implementation with the new caching strategy: This program produces the same outcome, but is nearly 3 times faster: But we now end up with a cache having essentially one entry per input number in the 1..1_000_000 range. So, I thought, perhaps it might be better to use an array, rather than a hash, for the cache (accessing an array item should be faster than a hash lookup). This is the code for this new implementation: With this new implementation, we still obtain the same result, but the program is now more than 55 times faster than my original one (and almost 20 times faster than the solution using a hash for the cache): I strongly suspected that using an array would be faster, but I frankly did not expect such a huge gain until I tested it. So, it is true that throwing more CPU cores at the problem makes the solution faster (although to a limited extent with my computer that has only 4 cores). But using a better algorithm can often be a better solution. | [reply] [d/l] [select] |
by marioroy (Prior) on Apr 14, 2020 at 23:21 UTC | |
Hi Laurent_R and fellow monks, Nice speedup! I applied 3 optimizations in preparation for a follow-up post combining caching with parallelization. This was done because File::Map adds overhead (i.e. it will not be as fast as an array). So, I asked myself can anything be done to further improve the serial implementation. And if yes, is it helpful. I provide the run time for each stage at the end of this post. It turns out that improvements were possible and will be converting collatz3_d.pl to use File::Map for the parallel demonstration. Update: Further improvements, Step 4. 1. Replaced division by 2.
2. Removed one level of branching.
3. Then reduced the number of loop iterations. Credit for reducing the # of loop iterations was from watching Notation and compressed dynamics, one minute into it (i.e. the T(x) notation).
4. Finally, less caching.
The final code optionally takes an argument.
Results:
Regards, Mario | [reply] [d/l] [select] |
by Laurent_R (Canon) on Apr 16, 2020 at 12:57 UTC | |
thanks a lot for your comments and suggestions. I did not expect such micro-optimizations to bring such a significant performance improvement, that's interesting. Especially replacing the division by 2 by a bit shift is the type of thing that I had stopped doing decades ago, when I figured that the C compiler I was using at the time was doing this type of optimization at least as well and often better than I was able to do. Good to be reminded that the optimizations you suggested can also be quite useful. Thank you. Following this very interesting thread and your new thread on MCE::Flow + Caching via File::Map, I made a new blog post on the Collatz sequence, trying to summarize some of the findings: http://blogs.perl.org/users/laurent_r/2020/04/revisiting-the-collatz-sequence-pwc-54.html. I did not try to explain your demonstration using File::Map for caching with parallel processing (your other thread) because I'm not sure to fully understand everything and was afraid of mis-representing your work. People can follow the link and read your own words. Thank you very much for your challenging ideas. | [reply] [d/l] |
by marioroy (Prior) on Apr 17, 2020 at 00:00 UTC | |
Re: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL fun)
by vr (Curate) on Apr 21, 2020 at 11:14 UTC | |
("long read":)) Here's an attempt to solve this Collatz challenge in purely vectorized fashion. As it (slowly) progressed, it soon became apparent the emerging result is somewhat impractical, merely a curiosity; now it's finished (i.e. I can't find how to improve it) and ready to amaze/amuse/frighten the public. The "PDL" was mentioned in this (or 2 related) threads, but it served just as binary container, "set/at" having same role as "substr + pack/unpack", so it wasn't really "PDL", was it. This node sticks to original task as stated: calculate the sequence length for all starting numbers up to 1000000 (1e6), and output the starting number and sequence length for the longest 20 sequences The naive straightforward implementation is simple, it should read without much comment (as "prose":)):
Above, completed sequences are marked as "BAD", which is just agreed-upon value, treated specially. (1) There are built-in methods to check for good/bad, it saves us comparisons while creating masks. (2) More important: BADs kind of taint whatever they interact with (cf. NaN in FP math), which is quite useful. The timing is sloooooow, yet decent among non-caching solutions around here. Another issue stems from QuickSort being unstable (I recently ranted, in this thread, about neat ordering and "correct 20" extraction). To fix the latter, there's qsortvec (and qsortveci companion) method to sort 2D array (as opposed to "qsort" for 1D, used above), i.e. 1st on 1st axis, then on 2nd. But here's dilemma: (1) build full-height (2 x 1e6) array, qsortvec, extract top-20. Possible, but, for speed, I'd prefer (2): qsort lengths (as above), extract "many enough" but close to 20, build small "2 x N" array, qsortvec, extract correct (and correctly arranged) top-20. For that, find value at MAX - TOP (445), look left, find how much to extract (22). More fuss: 1st column is to descend, 2nd to ascend -- thus temporarily negate (flip bits) one of them. So, huge and unpleasantly looking new "tail" after main loop, in script below, is there to fix top-20 extraction. But in fact it adds almost nothing to consumed time. To improve speed, there are couple of tricks. (1) Where has neat ability to work in list assignment -- same mask for several client piddles. (2) Marking 1's as "BAD" on each iteration is redundant. Piddle can be told to treat any value as bad, automatically. Sadly, setting this "any" to "1" won't work here, because $seqs & 1 would then result in something like [BAD 0 BAD 0 BAD 0 ...], regardless of having BADs in $seqs already. Let's mark "2" as bad, so stopper value in sequence would now be 2 instead of 1:
Other than that, it's the same non-caching approach, with original clarity and simplicity somewhat spoiled by fixes/optimizations:
So we have correct output and improved time. Good. Now to something more interesting -- let's add caching/looking-up. Because we are to use $seqs as index into $lengths, and indexing starts from 0, let's prepend a dummy 0th element. To kick-start indexing, and because value "2" is occupied to mark "BAD" i.e. sequence stopper, we'll add one more seed element to lengths. Further, lengths will now all start as BAD, and switched to computed values as we go:
(by the way, BAD in $lengths is still the default, for short, -32768) We'll also maintain $current lengths helper piddle, incremented by 1 or 2 depending on oddity mask of current sequences state. Observe, further, how where calls in list context are (over)-abused in code below. (I'm quite aware this code is no longer "a prose to read". Set MAX to 10 and dump primary piddles on each iteration to see what's going on. There are same 3. Other vars are masked views into them.)
And that's (~2x faster) I'm afraid is as good as it will go. As I understand, cache hits are significantly more rare than with consequential element after element array processing. For comparison, with the same hardware, Laurent_R's final/polished caching solution runs at 1.63s here, if I disable use of "magic number" 400 in there ("magic" constants to crank up performance aren't fair:)), and at 0.88s otherwise. For 1e7 numbers, running time becomes ~65s, i.e. it gets impractical, like I said. Switching on parallel processing made no difference. Use of multiple cores can be observed for only ~first second, then "viewports" into piddles become increasingly fragmented, work can't be split. Maybe I did something wrong, and certainly someone can improve even if a little bit, but I'm glad this mini-project is finally off my shoulders. | [reply] [d/l] [select] |
by marioroy (Prior) on Apr 21, 2020 at 21:02 UTC | |
Hi vr, Lots of PDL goodness. Wow! I do not know how this will perform unless taking a moment or two and give it a try. So here is a parallel version based on your 2nd example.
Results: I was expecting for mce_pdl2 using 1 core to be closer to vr_pdl2 than vr_pdl3. Maybe benefitting from CPU L2/L3 cache. Chunking seems to be the reason why mce_pdl2 (non-caching) ran nearly as fast as vr_pdl3 (caching). Below, includes 1 core testing for 1e7 with various chunk sizes.
Specifying 50000 for chunk size may run faster on 4/6/8 core machines.
Regards, Mario | [reply] [d/l] [select] |
by vr (Curate) on Apr 26, 2020 at 00:03 UTC | |
Sorry for delay with reply/update, first I got distracted with "let's try another optimization", then with other/unrelated things. I hope this topic is still warm :). My "caching" PDL solution (I'd prefer "looking-up" to "caching" or "memoization" in this case) is slow because, if vector is processed as a whole, cache hits happen less frequently i.e. after quite a few more useless steps. To illustrate with sequence from original challenge: 23 → 70 → 35 → 106 → 53 → 160 → 80 → 40 → 20 → 10 → 5 → 16 → 8 → 4 → 2 → 1 Number 70 reaches 35 in only one step but 35 itself is yet very far from completion. Simplifying, if 1..70 range was divided into e.g. 2 chunks processed independently one after another, then it would alleviate the problem. This "chunking" is very easy to add, just wrap the main loop into one external loop (labelled "CHUNKS"), which updates chunk boundaries and sets "views" into piddles, while internal loop (labelled "ITERATIONS") does exactly the same as in previous version. The $current piddle can now be of chunk size. The very first chunk includes dummy 0th element (size is CHUNK + 1), the final one can theoretically be as short as 1 single element. For 1e7 numbers, the computation time drops by factor of 3, from ~65 to ~21 seconds, and doesn't change much (variations stay at "noise" level) for chunk sizes ~2e4 .. ~2e5. Next, the idea was, that as soon as chunk is completed, e.g. chunk 11..20 with MAX = 100 and CHUNK = 10, it should be extremely cheap/easy to immediately get Collatz $lengths for even numbers in range 22..40, also marking these $seqs elements as BAD/completed. Though it would be almost as easily achieved "in due time", we don't need to consider/apply any masks at this early/immediate stage. Further, then it follows, we can simply set, in bulk and at the very beginning, each even $seqs element outside 1st chunk as BAD, even if CLs are unknown yet. 2 fragments marked with double '##' do exactly what previous paragraph says. They can be freely disabled for experiments; though the speed-up is very modest, just ~0.5s. Further "ideas"/considerations didn't result in palpable improvements (nor were pursued systematically). If it's cheap to shift-left indexes and increment CLs for each chunk as we go, what if we also populate and keep sparse (even only) CL LUT in range MAX..2MAX? Well, this range starts being populated relatively late, and if there was benefit of occasional additional cache hits, it seems to be cancelled out. No gain, no loss. The MAXLEN was introduced to check this. I tried to add masks/slices/bads for $current, so that e.g. the line $current ++; applies to valid subset only, especially since now all elements at even positions are dead weight from the very start, in chunks 2+. Well, unconditional "en masse" cheap operation appears to be faster, regardless of uselessly tackling "dead weight". I also considered variable chunk sizes (such as "begin with very short", etc.), but gave up.
Edit (bug fix). :(( With "dummy 0th element prepended", my intention for chunks was e.g. "0-10,11-20,21-30, ..., i.e. 1st is one longer. Then I fooled with "better presentation", from "infinite" while loop, to while loop with explicit condition, to for loop, with slightly different formulas for from,to,from_e,to_e, and messed up. Some even lengths aren't calculated, as seen with MAX and CHUNK e.g. 10 and 4. Easiest fix now is to leave chunks all equal (CHUNK+1); one LOC fixed (see "fixed"), above. Sorry. | [reply] [d/l] [select] |
by marioroy (Prior) on Apr 26, 2020 at 07:46 UTC | |
by marioroy (Prior) on Apr 26, 2020 at 22:14 UTC | |
| |
Re: Optimizing with Caching vs. Parallelizing (MCE::Map)
by Laurent_R (Canon) on Apr 14, 2020 at 11:16 UTC | |
for some reason, I somehow missed your very interesting post and the rest of this thread until just now. In fact, a couple of days after I submitted my solution to the Perl Weekly Challenge and posted my blog post you refer to above, I figured out that my caching strategy was in fact quite inefficient: the program doesn't need to store in the cache the full sequence, it would be enough to just store the number of its items. And that reduces considerably the overhead of the cache. I did not try this change in Perl so far, but I did it with the Raku solution. Changing the caching strategy made the Raku program 6 times faster. It seems the changes I made are quite similar to what vr suggested. These results are described in another blog that I haven't completed yet, but will be published hopefully soon. Now that I see that you have brought this up, I feel compelled to implement this change in the Perl version. I don't have time right now, but will come back with my updated version and the relevant timings soon. (Update: see below.) Thank you anyway for your very interesting post and for the just as interesting discussion it triggered. | [reply] |
A reply falls below the community's threshold of quality. You may see it by logging in. |