in reply to Re: mathematical proof

in thread mathematical proof

For example consider a hash in the worst case scenario, where you have just done a hash split. Then every element has been moved once. Half of them were around for the previous hash split, so have been around twice. Half of *them* were around the previous time, and so have been around 3 times. And so on. So the total amount of moving is

thanks to the well-known geometric series.n + n/2 + n/4 + ... = (1 + 1/2 + 1/4 + ...) * n = 2n = O(n)

With arrays the logic is similar except that each time we only grow an extra 1/5, and the overhead turns out to be 6n.

The result is that building either the hash or the array is O(n). But sorting the array is O(n log(n)). Therefore the hash scales better.

**But** (very big but) note that log(n) grows very slowly. If the data structures are big enough that they live on disk, then accessing your hash involves a disk seek, while merge sort involves accessing data on disk sequentially. Disk seeks are painfully slow, while accessing data on disk sequentially is orders of magnitude more efficient. That means that for data that lives on disk, the array approach is faster. Given the necessary numbers, generally by a couple of orders of magnitude.

**But** (very small but) hashes still scale better than arrays. So for *very* big data structures, hashing will again be faster than the array approach. However you are extremely unlikely to ever encounter a data set that large, if you did it probably wouldn't fit on your hard drive, and processing it sequentially by either approach would be too slow anyways. So you will want to parallelize your data processing. And when you do *that*, you will find that merge sort parallelizes better than hash lookups so arrays still win.

Thereby demonstrating that the theoretical big-O is not always the end of the story.

A final random note. Contrary to your final comment, it is the array that benefits more from duplicates. That is because if you're slightly clever in your merge sort, then eliminating lots of duplicates will reduce the size of your large passes, speeding up the sort. Hashing, by contrast, goes at the same exact speed but will take up less memory. (Until you seek to disk...)

**Update**: Lots of minor edits to improve wording.