Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
You say that ďa disk based solution isnít feasible,Ē

Using my hashing/vectors algorithm, my script processes 10,000 strings/second. To process 2**32 strings (the absolute limit of the hashes) would require 2**32 / 10000 / 3600 = 119 hours. IE. 0.000001s/string.

Using any sort of disk-based mechanism -- disk-based hash, b-tree, etc. -- would require at least 1 read and 1 write per string. Using the fastest (conventional) disks available that means adding at least 0.004 seconds for the read and 0.005 seconds for the write, to the processing time for every string.

So, that's 0.000001 + 0.004 + 0.005 = 0.009001 * 2**32 / (3600) = 10738 hrs or 447 days. A little under 1 1/4 years.

And the reality is that it will likely require at least 10 reads & 10 writes for every record. I'll leave that math to you.

but a fair amount of that is going to happen regardless.

When I stopped the process this morning, it had been running just under 62.5 hours.

In that time it had processed 2,375,798,283 strings and detected just 52,286 possible dups.

The cost of writing those 52,000 strings/positions to disk during the 62.5 hours of runtime is negligible; unmeasurable. Lost in the noise of processor variability due to variations in mains voltage. Ziltch.

Running a second pass to verify those 52,000 possible dups as false or real will require less time as a simple hash lookup can be used, so there is no need to maintain the vectors.

In any case, I'll trade 125 hours of runtime, for 1 1/4 years, every day of the week.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

In reply to Re^2: [OT] The statistics of hashing. by BrowserUk
in thread [OT] The statistics of hashing. by BrowserUk

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others studying the Monastery: (5)
    As of 2020-11-27 19:21 GMT
    Find Nodes?
      Voting Booth?

      No recent polls found