This 'never to explicitly loop in vectorized language' answer, unfortunately, hides the ugly truth that for very large data we can end with huge R x M matrices of random indices and equally huge (equally unnecessary) matrices of all re-samplings, and thus die because of 'Out of memory!'.
This can be a real issue! Keeping such index-sets around for exactly the right amount of time will help, which in turn might be assisted by making little functions that do individual operations on subsets (which a captured index-set ndarray which goes out of scope on finishing). Another thing that will help is the forthcoming loop fusion, discussed at
https://github.com/PDLPorters/pdl/issues/349: non-slice operations will become lazy, and on evaluation will potentially get put together into a new, loop-fused operation.
Something else that would help here, as also discussed on #349, is more generalised first-class "index operations". Ideas and contributions here, on the GitHub issue, on the PDL mailing lists, or any other means are most welcome!
One other thought is that, for larger ndarrays (because POSIX threads have a startup cost), the use of vectorised operations is the way to harness multiple cores for free (for operations that support this), which a Perl for-loop cannot achieve.