http://qs321.pair.com?node_id=53961


in reply to Sorting data that don't fit in memory

For sorting more data than you have room for I recommend tieing a hash through DB_File to a BTree. That allows you to just put items into the hash and then pull them out in sorted order.

There is some overhead to the structure, so without large file support this will usually only suffice for a little over a GB of data. With large file support and sufficient disk, you can handle many terabytes efficiently.

  • Comment on Re (tilly) 1: Sorting data that don't fit in memory

Replies are listed 'Best First'.
Re: Re (tilly) 1: Sorting data that don't fit in memory
by jeroenes (Priest) on Jan 24, 2001 at 23:11 UTC
    ++ for every answer in this thread. I'm very pleased with your responses. But there can only be one.... and this time it's tilly. I've been busy all afternoon reading the replies, reading docs and installing Berkeley-related stuff, but now I've run the first few tests after a pretty short time writing. I like BerkeleyDB that much, that I've decided to write most of my logic in perl. Smile!

    The script runs pretty fast for small amounts of data, up to 100k items was no problem. Now I'm waiting for the 1M test as I write. Will keep you posted here.

    Thanks a lot,

    Jeroen
    "We are not alone"(FZ)

Re: Re (tilly) 1: Sorting data that don't fit in memory
by Anonymous Monk on Oct 06, 2007 at 01:42 UTC
    That technique looks very interesting. Where can I find more information about how it works (independent of Perl, just the algorithm)?
      Please note that when I wrote that I was more conversant with theory than practice. Since then I've gained some practical experience. (Including sorting a dataset that was too big to store on the disk in uncompressed form!)

      The suggestion will work. It is just the wrong solution to use with a large dataset.

      Let me explain the problem. Suppose we have, say, 100 million lines to sort. Lines average 70 bytes each. We're doing this on a hard drive which turns at 6000 rpm and which can sustain 100 MB/second. How long does the btree solution take?

      Well what is disk seek time? 6000 rpm means 100 revolutions per second. When you choose to seek, the disk could be anywhere, so on average you have half a revolution. So average seek time is 1/200th of a second. We have (ballpark) 100 million writes to make, so 100 million seeks will take us about 500,000 seconds, which is about 5.8 days. (I've left out a number of complicating factors, but still it will take a matter of days to do it.)

      Now let's consider sorting on disk using a merge sort. Merge sort is the process of repeatedly taking sorted lists and merging them into longer sorted lists. If you're clever about it (any good merge sort implementation will be), you can arrange to always be merging lists of about the same size, and you can arrange to not have very many lists lying around at once.

      Our dataset is 7 GB (100 million lines times 70 bytes per line). With 100 million elements, a merge sort will finish in 27 passes. (There is a power of 2 pattern since with every pass the length of the lists double.) On each pass you have to read and write the data. So that is 7 GB * 27 * 2 = 378 GB. At 100 MB/s that takes 3780 seconds, or about an hour. (Again I've massively oversimplified, but that is about right.)

      Sure, both solutions finish, but which solution would you prefer to run?

      This is not to say that btrees don't have their place. They do. Particularly if you need to keep a changing dataset in sorted order. But if you want to sort a large dataset one time, they are not optimal. And, in fact, if you want to put a large dataset into a btree, you should first sort it with a mergesort then put it in the btree.

      The underlying data structure is a B-tree