http://qs321.pair.com?node_id=1127942

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

If you have 150e6 64-bit values to store for efficient lookup, perl's hashes(~14GB) and arrays(~5GB) are very expensive of memory, and a bitmap is out of the question.

The array is feasible, but to do lookups (binary search) requires it be sorted, and that's quite expensive for this size of array, even when using one of salva's in-place, XS modules.

The values are generated at runtime, and discarded at program end, so DBs are pointless. Even an in-memory sqlite DB which stores numbers as strings is off the cards.

I'm going to have to drop into Inline::C for this for both space and performance reasons.

A straight C array of 64-bit ints is ~1.2GB which is fine; but again sorting it so I can to O(logN) lookups is expensive.

I keep thinking about heaps (or Beaps or B-heaps or other variations), structures that "order" the values as they are inserted; but once built, can any of them be used for efficient (O(logN) or better) lookup?

Wikipedia isn't giving me much on the subject of lookup/searches.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Replies are listed 'Best First'.
Re: Heap structure for lookup?
by hdb (Monsignor) on May 27, 2015 at 08:24 UTC

    From the old days I remember the term red black tree for such search operations and a look up on CPAN says there are modules for it. Not sure whether this suits your requiement.

      The problem is that Tree::RedBlack is implemented in pure Perl, using a blessed hash for each node.

      Without having tried it, I estimate that a tree to hold my 150e6 values would require ~40GB of ram.

      I haven't found an XS implementation, but even then it would require at least 2x64-bit pointers + a 64-bit pointer to an SvUV + 1-bit per node. If they stored the R/B in an unused bit in one of the pointers, then that's 56 bytes * 150e6 = ~8GB, which is too much.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        This means (binary) trees generally are out question, correct?

        Looking into a different direction, is there any structure on your values that might be leveraged? Or are they more or less random 64bit integers?

Re: Heap structure for lookup?
by dave_the_m (Monsignor) on May 27, 2015 at 09:21 UTC
    Once the ints have been read in, how many lookups of them will be performed relative to the number of ints? And are the lookups just to test for existence, or to retrieve an associated value? Is there any structure to the ints - e.g. do they cluster together or are they random 64-bit values?

    Dave.

      Once the ints have been read in

      They're actually generated earlier in the algorithm and will be different from run to run.

      how many lookups of them will be performed relative to the number of ints?

      That depends upon how long is a (DNA) string. For the whole human genome, circa 3e9 loopkups.

      Is there any structure to the ints - e.g. do they cluster together or are they random 64-bit values?

      Effectively random and distributed across the full 2^64 range.

      The other thought I'm having is to mod the values and use that direct as a hash for inserting into a pre-allocated array -- I can calculate how many values will be generated in advance.

      Then the problem becomes how to deal with collisions.

      My thinking currently is that presize the array to the next large prime number and the use (hash+prime)%prime to step around the array until an empty slot is found.

      For any given starting point that should take me through every single slot before returning to the first, by which time I will have found an empty slot.

      What I don't have a feel for is how many collisions I'm going to get as I approach full? And how much bigger do I have to make the array to reduce that to a reasonable number? (I'm coding up a test now.)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Heap structure for lookup?
by RichardK (Parson) on May 27, 2015 at 09:18 UTC

    I'd look at using a B-tree for this type of problem, lookups are efficient once the tree is created. Balancing the tree when adding or removing elements can be somewhat expensive, but b-trees don't need to be balancing as frequently as binary trees so it's not too bad. Many databases use b-trees -- but you seem to be ruling those out for some reason ;)

      They certainly reduce the infrastructure overhead (number of pointers), but the last time I used b-tree was ~20 years ago, and my memory of them is ***AAAAAAAAAAAAARGGGG!***.

      We were using a library purchased from a 3rd party company that went to the wall and needed to move from 16 to 32-bit compiles.

      We had the source code and thought it would be reasonably easy, till we looked. It was a nightmare. The algorithms are really quite involved.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        As you're only storing single values it shouldn't be too bad, only the balancing code gets a bit involved, when you split the current node you insert into it's parent node which may cause it to split and so on back up to the root node. It's really not that hard, once you get your head round it.

        Anyway you've had 20 more years of experience since then, so you shouldn't have any problems ;)

Re: Heap structure for lookup?
by ohcamacj (Beadle) on May 27, 2015 at 21:04 UTC
    Judy arrays sounded cool. It's a somewhat hideously complex tree data type that uses multiple types of branch nodes to achieve memory compression.

      Yes. I've played with those for a different project quite recently.

      What I discovered was that work really well if they can keep their data structures in L1 cache.

      But using them from Perl code inevitably means using perl's hashes (all of Perl's namespaces (packages) are based around hashes), and they by their very nature exhibit very poor locality of reference and thus mess with the caches, with the result that the performance of Judy arrays drops away significantly.

      Also, they are as you mention, horribly complicated beasts which is fine until something goes wrong, and then you're up Sneak Creek without a paddle.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
A reply falls below the community's threshold of quality. You may see it by logging in.