|Perl: the Markov chain saw|
You say that ďa disk based solution isnít feasible,Ē
Using my hashing/vectors algorithm, my script processes 10,000 strings/second. To process 2**32 strings (the absolute limit of the hashes) would require 2**32 / 10000 / 3600 = 119 hours. IE. 0.000001s/string.
Using any sort of disk-based mechanism -- disk-based hash, b-tree, etc. -- would require at least 1 read and 1 write per string. Using the fastest (conventional) disks available that means adding at least 0.004 seconds for the read and 0.005 seconds for the write, to the processing time for every string.
So, that's 0.000001 + 0.004 + 0.005 = 0.009001 * 2**32 / (3600) = 10738 hrs or 447 days. A little under 1 1/4 years.
And the reality is that it will likely require at least 10 reads & 10 writes for every record. I'll leave that math to you.
but a fair amount of that is going to happen regardless.
When I stopped the process this morning, it had been running just under 62.5 hours.
In that time it had processed 2,375,798,283 strings and detected just 52,286 possible dups.
The cost of writing those 52,000 strings/positions to disk during the 62.5 hours of runtime is negligible; unmeasurable. Lost in the noise of processor variability due to variations in mains voltage. Ziltch.
Running a second pass to verify those 52,000 possible dups as false or real will require less time as a simple hash lookup can be used, so there is no need to maintain the vectors.
In any case, I'll trade 125 hours of runtime, for 1 1/4 years, every day of the week.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.