comment on

You say that “a disk based solution isn’t feasible,”

Using my hashing/vectors algorithm, my script processes 10,000 strings/second. To process 2**32 strings (the absolute limit of the hashes) would require 2**32 / 10000 / 3600 = 119 hours. IE. 0.000001s/string.

Using any sort of disk-based mechanism -- disk-based hash, b-tree, etc. -- would require at least 1 read and 1 write per string. Using the fastest (conventional) disks available that means adding at least 0.004 seconds for the read and 0.005 seconds for the write, to the processing time for every string.

So, that's 0.000001 + 0.004 + 0.005 = 0.009001 * 2**32 / (3600) = 10738 hrs or 447 days. A little under 1 1/4 years.

And the reality is that it will likely require at least 10 reads & 10 writes for every record. I'll leave that math to you.

but a fair amount of that is going to happen regardless.

When I stopped the process this morning, it had been running just under 62.5 hours.

In that time it had processed 2,375,798,283 strings and detected just 52,286 possible dups.

The cost of writing those 52,000 strings/positions to disk during the 62.5 hours of runtime is negligible; unmeasurable. Lost in the noise of processor variability due to variations in mains voltage. Ziltch.

Running a second pass to verify those 52,000 possible dups as false or real will require less time as a simple hash lookup can be used, so there is no need to maintain the vectors.

In any case, I'll trade 125 hours of runtime, for 1 1/4 years, every day of the week.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

In reply to Re^2: [OT] The statistics of hashing. by BrowserUk
in thread [OT] The statistics of hashing. by BrowserUk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Think about Loose Coupling
	PerlMonks