Thanks for writing this up. We talked about this a year ago! Brings back memories. I haven't made any sieve changes since your testing pointed out some inefficiencies and I did a release (June 13 2014). I just reran my benchmarks since I had a slightly older version of your sieve (before you added prefill and a few other things).
Here is a table of counting times for some segmented sieves. Single thread on i4770k.
|yafu||0.285||1.876||20.241||4m 27.8s||54m 51s|
|primesieve 5.4.2||0.119||1.661||23.480||4m 56.0s||58m 18s|
|primegen 0.97||0.258||2.750||33.843||9m 43.1s||207m 56s|
|MPU 0.41||0.320||3.529||42.971||9m 28.8s||119m 3s|
|TOeS v2||0.422||4.739||55.283||10m 41.7s||121m 3s|
|Roy algo3 v2||0.435||5.087||64.456||12m 28.0s||150m 10s|
|Roy algo3 v1||0.611||6.949||77.846|
|Walisch bit segmnt||0.909||8.226||91.691||19m 9.9s|
Comments: These all easily beat monolithic sieves such as Pari, Mathisen 1998, trivial SoE, MPU's plain sieve, etc. because I'm starting out at 10^9. All things are not equal: the code for counting is non-negligible and differs for most implementations. primegen is extremely sensitive to the cache parameter, which will differ on each platform. I tuned to the fastest value for this machine. The TOeS code also has some parameters I tuned but they're not as sensitive. Primesieve is set up for multiple threads -- this only uses one for direct comparison. Performance for primes from 2 to n may be very different than performance for primes from A to B (e.g. 10^18 to 10^18+1e6). This is purely for timing the sieves -- there are far faster ways to count primes (under a half-second for 10^13).
I don't benchmark printing primes because it becomes a contest of who can do the fastest itoa + write. That's nice, but the prime generation time gets lost. primesieve is faster generating primes, but with a good output routine I can beat it in printing by over 2x.
Random thoughts on this... yafu is quite fast, but the code isn't very usable outside of yafu. primesieve is extraordinarily fast, and would be faster if I let it use 8 threads. primegen is Bernstein's Sieve of Atkin which is pretty fast, but has poor growth compared to SoE (!) so my code catches it sometime before 10^12. Oliveira e Silva's example sieve is ok at smaller sizes but the growth rate is very nice so it's doing well at the high range (which was exactly what it was designed for). Algorithm 3 is doing very well, beating Flammenkamp's 2.0c code which used to be touted as one of the fastest sieves. Walisch's bit segment code is just an example of showing the idea and isn't optimized.