Re: size on disk of tied hashes

I'll second the recommendation for something like BerkeleyDB, but I'll also have to give you some sobering performance news. You're going to be pushing the limits of what is feasible. To get where you want in any reasonable time, you'll have to do a lot more work than you planned on. Let me get you started.

There are two mechanisms in BerkeleyDB for fast name lookup like you want. One is a hash, and the other is a BTree.

A hash works by mapping each piece of data to an apparently random bucket, and then goes and looks in that bucket. If it has done a good job of randomizing the data, then that bucket will have a very small number of items in it, and it is easy to find your data. If access speed is equivalent for all of your data, hashes have the fastest average performance.

BTrees are a special kind of tree that is carefully designed to allow them to be inserted to in any way you want and still avoid having long "chains" develop. If your data has substantial locality of reference, then BTrees can be substantially faster than hashes because they enable caches to work to their utmost. The goal is to try to fetch data out of RAM without going to disk. By contrast a hash (by design) accesses data in an apparently random fashion, which minimizes how much caching helps you.

If you can arrange to access data in a somewhat ordered fashion, then BTrees should be substantially faster than hashes. If you can't, with the amount of data that you have, hashing will be faster. With your real data it may take careful benchmarking to figure out how to best handle it.

How bad is hitting disk going to be? Well looking at your numbers, it looks to me like you have about a billion data points. What does it take to access disk a billion random times? Well disks today take about 0.01 seconds per seek. A billion of those takes 10,000,000 seconds, which is about half a year. And that is assuming that you only need to access the data once per location. In fact while writing a hash the average location on disk gets written once, read to be moved once, and then written to again, so that is 3 accesses per location (a move takes 2 accesses - one read and one write). So a year and a half to load the data, and a half-year to read it once. 2 years to load and run through it. And that isn't even counting CPU time or whatever time your logic wants to take!

Hopefully that scares you. This is a bad case that you could easily do a lot worse than. Getting around those facts of life is going to take some work. But knowing the issue, you can improve. For instance if you can tell the database when you create it how big it will eventually be, then the year of work spent recopying data when the hash has to grow goes away. Woo-hoo! We're down to a year! (Maybe.)

Now let me give you a best case. Suppose that your data can be presented in non-overlapping chunks of pretty much sorted data. And it is OK to access it in the same way. So you get to use a BTree. And you have done some parallelization logic, and found that you can run 4 copies at once (each taking a different part of the range). Well your 40 GB of data takes, say, 100 GB of disk. Or about 50 million pages. Each page of which you access directly twice (once in the writing pass, once in the reading pass) and the rest of the time you find in cache. So that's about 100 million disk accesses, of which 4 are going concurrently, or 25 million disk accesses per process. Which takes about 3 days. But it gets better than that. Because when you access data sequentially, the disk does not need to do a seek per page, instead it does some intelligent read-ahead. I don't know how big that is, but we we estimate a factor of 10, then you're actually done in 7 hours. Plus computation overhead, so call it a day. Plus development time. (You know, I really hope that I'm not messing up these numbers...)

Before you feel cheered, this is an ideal case that you're unlikely to get anywhere near.

Furthermore note that the figures that I've given you have little to do with how BerkeleyDB works, and a lot to do with how hardware works. The issues that I am raising are intrinsic to what you want to do, and are sufficiently complex that you won't get it automatically optimized any time soon.

To understand more about how BerkeleyDB does (and does not work), I highly recommend this guide. Hopefully you can figure out the application issues, and will do everything in your power to make sure that it is something which can be parallelized, and that any opportunities to improve the process are taken. If you have any possibility at all of partitioning the problem so that different subsets can be analyzed separately on different computers, I highly recommend that you do so. I don't know where you'd go to understand the disk issues better. A good sysadmin or DBA might help. They can talk to you about things like how RAID 1 can reduce average read seek time, how to stripe data on different disks, etc. Or where cache sizes are going to kick in and cause your test runs to look better than the real thing will. Or a ton of other stuff that I don't know because I'd need to learn about it myself to give better advice.

And above all else, when you deal with this much data, you have to learn to constantly do back of the envelope estimates like I've done above. Because your intuition about what will and will not be significant is going to be wrong, and it takes experience to figure out what matters. Furthermore when your back of the envelope estimate turns out to be seriously wrong, then you'll learn about factors that you hadn't thought of. (Like disk seek time!)

UPDATE: I should mention that my estimate of disk seek time was pulled from http://www.pcguide.com/ref/hdd/perf/perf/spec/posSeek-c.html. For a sanity check, see the specs for Seagate. (Other manufacturers are similar.) And remember that everyone does everything that they can to avoid having to seek - with great success. It is rare in normal practice for requests for data to cause a random seek to disk.

Comment on Re: size on disk of tied hashes

Replies are listed 'Best First'.
Re^2: size on disk of tied hashes by danderson (Beadle) on Aug 11, 2004 at 21:21 UTC
Thanks for your input, everybody. Wow, tilly, that was a heck of a post. I know how I'm going to do this now, so I might as well explain it (so you don't all think I'm a pointless waste of time!) I have worked out the data to be under 50 million datapoints. For the initial sorting I'm going to borrow two gigs of ram to minimize hard drive hits (too bad I can't add space to the L1 and L2 caches... yes, it's going to run on x86, yes, I know that's suboptimal). I'm also going to borrow a second, large harddrive, and throw in a small one I have handy. The sorting algorithm, roughly, will be: -Save the data to the borrowed HD. Yeah, I know, it might take a while - that's OK, I've got no choice on this one. Save it to flat files of an appropriate size (to be empirically tested first) -Split the KVPs. That is, read each key/value pair out, append the key to a keylist (with 'pointer' data) and the data to a data file. The keys file will be on the small HD, the data will be split among several files (obviously <2GB) on the purchased HD. This splitting will proceed at the slowest of the HD's read/write accesses, but since they'll all be sequential there will be very good (write or read)/page access and page access/cache ratios, effectively approaching the maximum read/write speed of the slowest HD. This requires trivial CPU and RAM, so they aren't considerations. -sort. Not the data, obviously, I'm going to leave that well enough alone for now. The keys and their new (much smaller) 'pointers' (64 bits/8 bytes will do it no problem, split into a file part and an offset part) will be sorted in a B-tree. The tree unfortunately will have a bit of a performance problem, taking only slightly better than knlog(n) time to complete (assuming berkely does early splitting, which is probable). Since these accesses will be effectively random, that's potentially quite a bit over 50 million HD cache misses. At 8.5ms/miss (the small one actually is a barracuda, so thanks for that link, tilly!) plus an average rotational latency of ~5.5ms, that's 14ms/miss. That's three years. Gah. Thankfully, there's a workaround - I've done some quick calculations on sample sets of data, and the keys (yes, there will be duplicates) are almost all under 16 characters, and I haven't seen any over 32. Note that I said characters! Since I don't mind getting a few false positiveS, these can (reasonably quickly) be converted into a binary representation (throwing everything that's not a char out, and lowercasing all chars, then subbing from a to get a number that's expressable in 5 bits). With 64 bit keys, in 32 bytes I can express keys of up to 38 'characters' - perfect! B-trees shouldn't need more than 8 bytes per node (prev, next, lt, gt, each with 64bits), so each element will be expressable in 40 bytes. With 2G of RAM, very little of the data will have to be on the disk (I'm going to have to look into semi-memory-mapped files, unfortunately). Huzzah! Now my sort time is large, certainly, but easily under a day, especially considering the good cache hit ratio for the 'top' of the tree. Once done, writing to the HD at half it's max sustained write speed (I figure this is a fair metric) is 2G/20Mbytes/s = under two minutes! Excellent. So by this point the data will be on the correct hard drive, the keys will be on the smaller one, and the memory and spare HD can be returned. Cool. Access time? No problem. With a width of four (less than what Berkely acutally uses, I believe) this will take under 8 hard drive hits, given a balanced tree. That's a tenth of a second, which admittedly could be better, but ah well. Once you consider that the top of the tree is probably going to stay in the hard drive's cache, this will probably be around two thirds to half of that on average. Then there's the hit to the data drive, which will be around or under 20ms. Sure, this is slow, but (in the beginning - oh god, don't ask me how I'm going to scale it if/when the time comes) usage probably won't be over 1/s, so we're good. And... the kicker. Updates! Updates will be painful - and at the rate of around 30,000/day, give or take, that's one every three seconds! Fortunately the insertion will only be one HD hit slower than searches, so it'll be OK - the hard part, which I won't go into here, is finding the correct 30k to update (long, long explanation). So: it's doable. A hash would be great for it's constant-time access, but building and updating it would be painful (I think - I've dealt with lots of weird tree variations, but most of the hashing algorithms I've seen have been the trivial case, so it's entirely possible there's a good hash solution out there).	[reply]
Re^3: size on disk of tied hashes by BrowserUk (Patriarch) on Aug 12, 2004 at 12:26 UTC
If you find Berkeley too slow, or too memory expensive in practice, you might reconsider the md5 hashing suggestion. I've implemented this for my own purposes now, and the highlights are: Indexing 100_000_000, 160 byte records in a data file. The index requires 2.23 GB for 100 Million records. Data record size does not affect the index size. It is a fixed 24-bytes/record. Building and sorting the index: 3 hours (worst case; under 1 hour best); Accessing 10_000 randomly chosen records: Under 3 minutes. That's locating in the index entry and reading the record combined. Worse timing: 1000 trials of binsearch ( 37.753s total), 37.753ms/trial Best timing: 10000 trials of binsearch ( 175.755s total), 17.576ms/trial Update: A 100,000 thrials just completed: 100000 trials of binsearch ( 1,643s total), 16.437ms/trial Insert/delete* a record: Currently 1.2 seconds. This can be improved, I believe substantially. Insertion appends the new record to the end of the data file, and inserts the appropriate index entry. * Deletion consists of removing the index entry and adding it to a "deleted.idx" file. The actual record remains in-place until a compaction process is run. The latter is not part of the timing above. The above is hitting the disk (or system caches) for every read. I have some ideas for adding my own buffering , but they need testing. The test system was XP/NTFS on a 2.2 Ghz processor with a WDC WD800BB-75CAA0 75GB HD. The datafile is compressed, the index not. For contrast, I gave up waiting for Berkeley to build a BTree index from a pre-sorted file of index entries after 18+ hours and 57% complete. Hence, I don't have figures for access/insertions or deletions. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply]
Re^3: size on disk of tied hashes by tilly (Archbishop) on Aug 11, 2004 at 21:49 UTC
Big suggestion. Your first step should be to take your data and do a mergesort to get it into order. Mergesorts sequentially process data in order, and therefore represent a best case for standard pre-fetch/caching algorithms used in filesystems and on disk. Inserting sorted data into a BTree again causes filesystem caching to do you a world of good. After that I'd expect your access speed to be better than you're planning on. Smart BTree implementations don't have a constant branch factor - instead they put as many branches into a page as they can. Even with the amount of data that you're facing, I think that a BTree should require no more than 5 page accesses to pull your data. And better yet, the root pages are usually going to be in memory so you'll probably average under 2 hits to disk for data. (Plus if there is some locality of reference in how it is used, then it could be less.) As for scaling, you can get a lot out of a bit of parallelism. The fact that one request needs to wait disk is no reason that another request cannot be being served as well. One good architecture for that looks like a dedicated server with several threads or processes - you connect to one of them and it serves you. Unbeknownst to you, others are doing the same. (I'd strongly recommend against using BerkeleyDB in a CGI environment because of problems with server start/stop. In a mod_perl environment is fine, but in a CGI environment there are potential issues that you don't want to hit.) Of course if you're going to talk about that, then you might as well upgrade to the invented wheel of a decent RDBMS with row-level locking. Modifying this recipe to an RDBMS, create your table, and create an index on it that will internally be a BTree. (How you do that varies by database - see a DBA and/or documentation.) Then take your dataset, do the mergesort on your own, and proceed to insert it into that table in order. If you've lined things up right, the load should proceed at reasonable speed, and when you're done the database has already figured out how to coordinate having multiple processes making requests of it at once.	[reply]


laziness, impatience, and hubris
	PerlMonks