|The stupid question is the question not asked|
Re: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL fun)by vr (Curate)
|on Apr 21, 2020 at 11:14 UTC||Need Help??|
Here's an attempt to solve this Collatz challenge in purely vectorized fashion. As it (slowly) progressed, it soon became apparent the emerging result is somewhat impractical, merely a curiosity; now it's finished (i.e. I can't find how to improve it) and ready to amaze/amuse/frighten the public.
The "PDL" was mentioned in this (or 2 related) threads, but it served just as binary container, "set/at" having same role as "substr + pack/unpack", so it wasn't really "PDL", was it. This node sticks to original task as stated:
calculate the sequence length for all starting numbers up to 1000000 (1e6), and output the starting number and sequence length for the longest 20 sequences
The naive straightforward implementation is simple, it should read without much comment (as "prose":)):
Above, completed sequences are marked as "BAD", which is just agreed-upon value, treated specially. (1) There are built-in methods to check for good/bad, it saves us comparisons while creating masks. (2) More important: BADs kind of taint whatever they interact with (cf. NaN in FP math), which is quite useful.
The timing is sloooooow, yet decent among non-caching solutions around here. Another issue stems from QuickSort being unstable (I recently ranted, in this thread, about neat ordering and "correct 20" extraction).
To fix the latter, there's qsortvec (and qsortveci companion) method to sort 2D array (as opposed to "qsort" for 1D, used above), i.e. 1st on 1st axis, then on 2nd. But here's dilemma: (1) build full-height (2 x 1e6) array, qsortvec, extract top-20. Possible, but, for speed, I'd prefer (2): qsort lengths (as above), extract "many enough" but close to 20, build small "2 x N" array, qsortvec, extract correct (and correctly arranged) top-20.
For that, find value at MAX - TOP (445), look left, find how much to extract (22). More fuss: 1st column is to descend, 2nd to ascend -- thus temporarily negate (flip bits) one of them. So, huge and unpleasantly looking new "tail" after main loop, in script below, is there to fix top-20 extraction. But in fact it adds almost nothing to consumed time.
To improve speed, there are couple of tricks. (1) Where has neat ability to work in list assignment -- same mask for several client piddles. (2) Marking 1's as "BAD" on each iteration is redundant. Piddle can be told to treat any value as bad, automatically. Sadly, setting this "any" to "1" won't work here, because $seqs & 1 would then result in something like [BAD 0 BAD 0 BAD 0 ...], regardless of having BADs in $seqs already. Let's mark "2" as bad, so stopper value in sequence would now be 2 instead of 1:
Other than that, it's the same non-caching approach, with original clarity and simplicity somewhat spoiled by fixes/optimizations:
So we have correct output and improved time. Good.
Now to something more interesting -- let's add caching/looking-up. Because we are to use $seqs as index into $lengths, and indexing starts from 0, let's prepend a dummy 0th element. To kick-start indexing, and because value "2" is occupied to mark "BAD" i.e. sequence stopper, we'll add one more seed element to lengths. Further, lengths will now all start as BAD, and switched to computed values as we go:
(by the way, BAD in $lengths is still the default, for short, -32768)
We'll also maintain $current lengths helper piddle, incremented by 1 or 2 depending on oddity mask of current sequences state. Observe, further, how where calls in list context are (over)-abused in code below. (I'm quite aware this code is no longer "a prose to read". Set MAX to 10 and dump primary piddles on each iteration to see what's going on. There are same 3. Other vars are masked views into them.)
And that's (~2x faster) I'm afraid is as good as it will go. As I understand, cache hits are significantly more rare than with consequential element after element array processing. For comparison, with the same hardware, Laurent_R's final/polished caching solution runs at 1.63s here, if I disable use of "magic number" 400 in there ("magic" constants to crank up performance aren't fair:)), and at 0.88s otherwise.
For 1e7 numbers, running time becomes ~65s, i.e. it gets impractical, like I said. Switching on parallel processing made no difference. Use of multiple cores can be observed for only ~first second, then "viewports" into piddles become increasingly fragmented, work can't be split.
Maybe I did something wrong, and certainly someone can improve even if a little bit, but I'm glad this mini-project is finally off my shoulders.