Re^3: size on disk of tied hashes

Big suggestion. Your first step should be to take your data and do a mergesort to get it into order. Mergesorts sequentially process data in order, and therefore represent a best case for standard pre-fetch/caching algorithms used in filesystems and on disk. Inserting sorted data into a BTree again causes filesystem caching to do you a world of good.

After that I'd expect your access speed to be better than you're planning on. Smart BTree implementations don't have a constant branch factor - instead they put as many branches into a page as they can. Even with the amount of data that you're facing, I think that a BTree should require no more than 5 page accesses to pull your data. And better yet, the root pages are usually going to be in memory so you'll probably average under 2 hits to disk for data. (Plus if there is some locality of reference in how it is used, then it could be less.)

As for scaling, you can get a lot out of a bit of parallelism. The fact that one request needs to wait disk is no reason that another request cannot be being served as well. One good architecture for that looks like a dedicated server with several threads or processes - you connect to one of them and it serves you. Unbeknownst to you, others are doing the same. (I'd strongly recommend against using BerkeleyDB in a CGI environment because of problems with server start/stop. In a mod_perl environment is fine, but in a CGI environment there are potential issues that you don't want to hit.)

Of course if you're going to talk about that, then you might as well upgrade to the invented wheel of a decent RDBMS with row-level locking. Modifying this recipe to an RDBMS, create your table, and create an index on it that will internally be a BTree. (How you do that varies by database - see a DBA and/or documentation.) Then take your dataset, do the mergesort on your own, and proceed to insert it into that table in order.

If you've lined things up right, the load should proceed at reasonable speed, and when you're done the database has already figured out how to coordinate having multiple processes making requests of it at once.

Comment on Re^3: size on disk of tied hashes


Think about Loose Coupling
	PerlMonks