http://qs321.pair.com?node_id=381848


in reply to size on disk of tied hashes

I'll second the recommendation for something like BerkeleyDB, but I'll also have to give you some sobering performance news. You're going to be pushing the limits of what is feasible. To get where you want in any reasonable time, you'll have to do a lot more work than you planned on. Let me get you started.

There are two mechanisms in BerkeleyDB for fast name lookup like you want. One is a hash, and the other is a BTree.

A hash works by mapping each piece of data to an apparently random bucket, and then goes and looks in that bucket. If it has done a good job of randomizing the data, then that bucket will have a very small number of items in it, and it is easy to find your data. If access speed is equivalent for all of your data, hashes have the fastest average performance.

BTrees are a special kind of tree that is carefully designed to allow them to be inserted to in any way you want and still avoid having long "chains" develop. If your data has substantial locality of reference, then BTrees can be substantially faster than hashes because they enable caches to work to their utmost. The goal is to try to fetch data out of RAM without going to disk. By contrast a hash (by design) accesses data in an apparently random fashion, which minimizes how much caching helps you.

If you can arrange to access data in a somewhat ordered fashion, then BTrees should be substantially faster than hashes. If you can't, with the amount of data that you have, hashing will be faster. With your real data it may take careful benchmarking to figure out how to best handle it.

How bad is hitting disk going to be? Well looking at your numbers, it looks to me like you have about a billion data points. What does it take to access disk a billion random times? Well disks today take about 0.01 seconds per seek. A billion of those takes 10,000,000 seconds, which is about half a year. And that is assuming that you only need to access the data once per location. In fact while writing a hash the average location on disk gets written once, read to be moved once, and then written to again, so that is 3 accesses per location (a move takes 2 accesses - one read and one write). So a year and a half to load the data, and a half-year to read it once. 2 years to load and run through it. And that isn't even counting CPU time or whatever time your logic wants to take!

Hopefully that scares you. This is a bad case that you could easily do a lot worse than. Getting around those facts of life is going to take some work. But knowing the issue, you can improve. For instance if you can tell the database when you create it how big it will eventually be, then the year of work spent recopying data when the hash has to grow goes away. Woo-hoo! We're down to a year! (Maybe.)

Now let me give you a best case. Suppose that your data can be presented in non-overlapping chunks of pretty much sorted data. And it is OK to access it in the same way. So you get to use a BTree. And you have done some parallelization logic, and found that you can run 4 copies at once (each taking a different part of the range). Well your 40 GB of data takes, say, 100 GB of disk. Or about 50 million pages. Each page of which you access directly twice (once in the writing pass, once in the reading pass) and the rest of the time you find in cache. So that's about 100 million disk accesses, of which 4 are going concurrently, or 25 million disk accesses per process. Which takes about 3 days. But it gets better than that. Because when you access data sequentially, the disk does not need to do a seek per page, instead it does some intelligent read-ahead. I don't know how big that is, but we we estimate a factor of 10, then you're actually done in 7 hours. Plus computation overhead, so call it a day. Plus development time. (You know, I really hope that I'm not messing up these numbers...)

Before you feel cheered, this is an ideal case that you're unlikely to get anywhere near.

Furthermore note that the figures that I've given you have little to do with how BerkeleyDB works, and a lot to do with how hardware works. The issues that I am raising are intrinsic to what you want to do, and are sufficiently complex that you won't get it automatically optimized any time soon.

To understand more about how BerkeleyDB does (and does not work), I highly recommend this guide. Hopefully you can figure out the application issues, and will do everything in your power to make sure that it is something which can be parallelized, and that any opportunities to improve the process are taken. If you have any possibility at all of partitioning the problem so that different subsets can be analyzed separately on different computers, I highly recommend that you do so. I don't know where you'd go to understand the disk issues better. A good sysadmin or DBA might help. They can talk to you about things like how RAID 1 can reduce average read seek time, how to stripe data on different disks, etc. Or where cache sizes are going to kick in and cause your test runs to look better than the real thing will. Or a ton of other stuff that I don't know because I'd need to learn about it myself to give better advice.

And above all else, when you deal with this much data, you have to learn to constantly do back of the envelope estimates like I've done above. Because your intuition about what will and will not be significant is going to be wrong, and it takes experience to figure out what matters. Furthermore when your back of the envelope estimate turns out to be seriously wrong, then you'll learn about factors that you hadn't thought of. (Like disk seek time!)

UPDATE: I should mention that my estimate of disk seek time was pulled from http://www.pcguide.com/ref/hdd/perf/perf/spec/posSeek-c.html. For a sanity check, see the specs for Seagate. (Other manufacturers are similar.) And remember that everyone does everything that they can to avoid having to seek - with great success. It is rare in normal practice for requests for data to cause a random seek to disk.