++ for every answer in this thread. I'm very pleased
with your responses. But there can only be one....
and this time it's tilly. I've been busy all afternoon
reading the replies, reading docs and installing Berkeley-related
stuff, but now I've run the first few tests after a pretty
short time writing. I like BerkeleyDB that much,
that I've decided to write most of my logic in perl. Smile!
The script runs pretty fast for small amounts of data,
up to 100k items was no problem. Now I'm waiting for the
1M test as I write. Will keep you posted here.
Thanks a lot,
Jeroen
"We are not alone"(FZ) | [reply] |
That technique looks very interesting. Where can I find more information about how it works (independent of Perl, just the algorithm)? | [reply] |
Please note that when I wrote that I was more conversant with theory than practice. Since then I've gained some practical experience. (Including sorting a dataset that was too big to store on the disk in uncompressed form!)
The suggestion will work. It is just the wrong solution to use with a large dataset.
Let me explain the problem. Suppose we have, say, 100 million lines to sort. Lines average 70 bytes each. We're doing this on a hard drive which turns at 6000 rpm and which can sustain 100 MB/second. How long does the btree solution take?
Well what is disk seek time? 6000 rpm means 100 revolutions per second. When you choose to seek, the disk could be anywhere, so on average you have half a revolution. So average seek time is 1/200th of a second. We have (ballpark) 100 million writes to make, so 100 million seeks will take us about 500,000 seconds, which is about 5.8 days. (I've left out a number of complicating factors, but still it will take a matter of days to do it.)
Now let's consider sorting on disk using a merge sort. Merge sort is the process of repeatedly taking sorted lists and merging them into longer sorted lists. If you're clever about it (any good merge sort implementation will be), you can arrange to always be merging lists of about the same size, and you can arrange to not have very many lists lying around at once.
Our dataset is 7 GB (100 million lines times 70 bytes per line). With 100 million elements, a merge sort will finish in 27 passes. (There is a power of 2 pattern since with every pass the length of the lists double.) On each pass you have to read and write the data. So that is 7 GB * 27 * 2 = 378 GB. At 100 MB/s that takes 3780 seconds, or about an hour. (Again I've massively oversimplified, but that is about right.)
Sure, both solutions finish, but which solution would you prefer to run?
This is not to say that btrees don't have their place. They do. Particularly if you need to keep a changing dataset in sorted order. But if you want to sort a large dataset one time, they are not optimal. And, in fact, if you want to put a large dataset into a btree, you should first sort it with a mergesort then put it in the btree.
| [reply] |
The underlying data structure is a B-tree
| [reply] |