http://qs321.pair.com?node_id=1130272


in reply to The sieve of Xuedong Luo (Algorithm3) for generating prime numbers

Thanks for writing this up. We talked about this a year ago! Brings back memories. I haven't made any sieve changes since your testing pointed out some inefficiencies and I did a release (June 13 2014). I just reran my benchmarks since I had a slightly older version of your sieve (before you added prefill and a few other things).

Here is a table of counting times for some segmented sieves. Single thread on i4770k.

Segmented sieves 10^9 10^10 10^11 10^12 10^13
yafu 0.285 1.876 20.241 4m 27.8s 54m 51s
primesieve 5.4.2 0.119 1.661 23.480 4m 56.0s 58m 18s
primegen 0.97 0.258 2.750 33.843 9m 43.1s 207m 56s
MPU 0.41 0.320 3.529 42.971 9m 28.8s 119m 3s
TOeS v2 0.422 4.739 55.283 10m 41.7s 121m 3s
Roy algo3 v2 0.435 5.087 64.456 12m 28.0s 150m 10s
Flammenkamp 0.528 5.937 79.211 16m 24.1s
tverniquet 0.643 6.937 80.876 17m 27.7s
Roy algo3 v1 0.611 6.949 77.846
Walisch bit segmnt 0.909 8.226 91.691 19m 9.9s

Comments: These all easily beat monolithic sieves such as Pari, Mathisen 1998, trivial SoE, MPU's plain sieve, etc. because I'm starting out at 10^9. All things are not equal: the code for counting is non-negligible and differs for most implementations. primegen is extremely sensitive to the cache parameter, which will differ on each platform. I tuned to the fastest value for this machine. The TOeS code also has some parameters I tuned but they're not as sensitive. Primesieve is set up for multiple threads -- this only uses one for direct comparison. Performance for primes from 2 to n may be very different than performance for primes from A to B (e.g. 10^18 to 10^18+1e6). This is purely for timing the sieves -- there are far faster ways to count primes (under a half-second for 10^13).

I don't benchmark printing primes because it becomes a contest of who can do the fastest itoa + write. That's nice, but the prime generation time gets lost. primesieve is faster generating primes, but with a good output routine I can beat it in printing by over 2x.

Random thoughts on this... yafu is quite fast, but the code isn't very usable outside of yafu. primesieve is extraordinarily fast, and would be faster if I let it use 8 threads. primegen is Bernstein's Sieve of Atkin which is pretty fast, but has poor growth compared to SoE (!) so my code catches it sometime before 10^12. Oliveira e Silva's example sieve is ok at smaller sizes but the growth rate is very nice so it's doing well at the high range (which was exactly what it was designed for). Algorithm 3 is doing very well, beating Flammenkamp's 2.0c code which used to be touted as one of the fastest sieves. Walisch's bit segment code is just an example of showing the idea and isn't optimized.