good chemistry is complicated,
and a little bit messy -LW
Re^3: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster)by vr (Curate)
|on Apr 26, 2020 at 00:03 UTC||Need Help??|
Sorry for delay with reply/update, first I got distracted with "let's try another optimization", then with other/unrelated things. I hope this topic is still warm :).
My "caching" PDL solution (I'd prefer "looking-up" to "caching" or "memoization" in this case) is slow because, if vector is processed as a whole, cache hits happen less frequently i.e. after quite a few more useless steps. To illustrate with sequence from original challenge:
23 → 70 → 35 → 106 → 53 → 160 → 80 → 40 → 20 → 10 → 5 → 16 → 8 → 4 → 2 → 1
Number 70 reaches 35 in only one step but 35 itself is yet very far from completion. Simplifying, if 1..70 range was divided into e.g. 2 chunks processed independently one after another, then it would alleviate the problem.
This "chunking" is very easy to add, just wrap the main loop into one external loop (labelled "CHUNKS"), which updates chunk boundaries and sets "views" into piddles, while internal loop (labelled "ITERATIONS") does exactly the same as in previous version. The $current piddle can now be of chunk size. The very first chunk includes dummy 0th element (size is CHUNK + 1), the final one can theoretically be as short as 1 single element.
For 1e7 numbers, the computation time drops by factor of 3, from ~65 to ~21 seconds, and doesn't change much (variations stay at "noise" level) for chunk sizes ~2e4 .. ~2e5.
Next, the idea was, that as soon as chunk is completed, e.g. chunk 11..20 with MAX = 100 and CHUNK = 10, it should be extremely cheap/easy to immediately get Collatz $lengths for even numbers in range 22..40, also marking these $seqs elements as BAD/completed. Though it would be almost as easily achieved "in due time", we don't need to consider/apply any masks at this early/immediate stage. Further, then it follows, we can simply set, in bulk and at the very beginning, each even $seqs element outside 1st chunk as BAD, even if CLs are unknown yet.
2 fragments marked with double '##' do exactly what previous paragraph says. They can be freely disabled for experiments; though the speed-up is very modest, just ~0.5s.
Further "ideas"/considerations didn't result in palpable improvements (nor were pursued systematically).
If it's cheap to shift-left indexes and increment CLs for each chunk as we go, what if we also populate and keep sparse (even only) CL LUT in range MAX..2MAX? Well, this range starts being populated relatively late, and if there was benefit of occasional additional cache hits, it seems to be cancelled out. No gain, no loss. The MAXLEN was introduced to check this.
I tried to add masks/slices/bads for $current, so that e.g. the line $current ++; applies to valid subset only, especially since now all elements at even positions are dead weight from the very start, in chunks 2+. Well, unconditional "en masse" cheap operation appears to be faster, regardless of uselessly tackling "dead weight".
I also considered variable chunk sizes (such as "begin with very short", etc.), but gave up.
Edit (bug fix). :(( With "dummy 0th element prepended", my intention for chunks was e.g. "0-10,11-20,21-30, ..., i.e. 1st is one longer. Then I fooled with "better presentation", from "infinite" while loop, to while loop with explicit condition, to for loop, with slightly different formulas for from,to,from_e,to_e, and messed up. Some even lengths aren't calculated, as seen with MAX and CHUNK e.g. 10 and 4. Easiest fix now is to leave chunks all equal (CHUNK+1); one LOC fixed (see "fixed"), above. Sorry.