http://qs321.pair.com?node_id=645849


in reply to Re: Re (tilly) 1: Sorting data that don't fit in memory
in thread Sorting data that don't fit in memory

Please note that when I wrote that I was more conversant with theory than practice. Since then I've gained some practical experience. (Including sorting a dataset that was too big to store on the disk in uncompressed form!)

The suggestion will work. It is just the wrong solution to use with a large dataset.

Let me explain the problem. Suppose we have, say, 100 million lines to sort. Lines average 70 bytes each. We're doing this on a hard drive which turns at 6000 rpm and which can sustain 100 MB/second. How long does the btree solution take?

Well what is disk seek time? 6000 rpm means 100 revolutions per second. When you choose to seek, the disk could be anywhere, so on average you have half a revolution. So average seek time is 1/200th of a second. We have (ballpark) 100 million writes to make, so 100 million seeks will take us about 500,000 seconds, which is about 5.8 days. (I've left out a number of complicating factors, but still it will take a matter of days to do it.)

Now let's consider sorting on disk using a merge sort. Merge sort is the process of repeatedly taking sorted lists and merging them into longer sorted lists. If you're clever about it (any good merge sort implementation will be), you can arrange to always be merging lists of about the same size, and you can arrange to not have very many lists lying around at once.

Our dataset is 7 GB (100 million lines times 70 bytes per line). With 100 million elements, a merge sort will finish in 27 passes. (There is a power of 2 pattern since with every pass the length of the lists double.) On each pass you have to read and write the data. So that is 7 GB * 27 * 2 = 378 GB. At 100 MB/s that takes 3780 seconds, or about an hour. (Again I've massively oversimplified, but that is about right.)

Sure, both solutions finish, but which solution would you prefer to run?

This is not to say that btrees don't have their place. They do. Particularly if you need to keep a changing dataset in sorted order. But if you want to sort a large dataset one time, they are not optimal. And, in fact, if you want to put a large dataset into a btree, you should first sort it with a mergesort then put it in the btree.