donkeykong has asked for the wisdom of the Perl Monks concerning the following question:
|Replies are listed 'Best First'.|
Re: mathematical proof
by ikegami (Patriarch) on Feb 03, 2009 at 01:53 UTC
Most hash insertions are O(1). When the hash has to be expanded, the insertion becomes O(N). Since Perl grows hashes exponentially, inserting N elements is O(N log N). ( See tilly's reply. Inserting N elements into a hash insertions is O(N). )
Most array appends are O(1). When the array has to be expanded, the append becomes O(N). Since Perl grows arrays exponentially, inserting N elements (one at a time) is O(N log N). ( See tilly's reply. Inserting N elements into a hash insertions is O(N). ) Inserting all N elements at once is just O(N), though. Once the array is loaded, you need to sort it, for a total of O(N log N).
So you end up with O(N log N) either way. The question that remains is what's factor, and how close to the bound (N log N) is the actual performance for your actual data.
The percentage of duplicates is going to matter. The more duplicates, the better the hash approach will be.
For example consider a hash in the worst case scenario, where you have just done a hash split. Then every element has been moved once. Half of them were around for the previous hash split, so have been around twice. Half of them were around the previous time, and so have been around 3 times. And so on. So the total amount of moving is
thanks to the well-known geometric series.
With arrays the logic is similar except that each time we only grow an extra 1/5, and the overhead turns out to be 6n.
The result is that building either the hash or the array is O(n). But sorting the array is O(n log(n)). Therefore the hash scales better.
But (very big but) note that log(n) grows very slowly. If the data structures are big enough that they live on disk, then accessing your hash involves a disk seek, while merge sort involves accessing data on disk sequentially. Disk seeks are painfully slow, while accessing data on disk sequentially is orders of magnitude more efficient. That means that for data that lives on disk, the array approach is faster. Given the necessary numbers, generally by a couple of orders of magnitude.
But (very small but) hashes still scale better than arrays. So for very big data structures, hashing will again be faster than the array approach. However you are extremely unlikely to ever encounter a data set that large, if you did it probably wouldn't fit on your hard drive, and processing it sequentially by either approach would be too slow anyways. So you will want to parallelize your data processing. And when you do that, you will find that merge sort parallelizes better than hash lookups so arrays still win.
Thereby demonstrating that the theoretical big-O is not always the end of the story.
A final random note. Contrary to your final comment, it is the array that benefits more from duplicates. That is because if you're slightly clever in your merge sort, then eliminating lots of duplicates will reduce the size of your large passes, speeding up the sort. Hashing, by contrast, goes at the same exact speed but will take up less memory. (Until you seek to disk...)
Update: Lots of minor edits to improve wording.
For example consider a hash in the worst case scenario, where you have just done a hash split. Then every element has been moved once. Half of them were around for the previous hash split, so have been around twice.That assumes a hash split only after Ω(n) new inserts. But then in the worst case scenario where all elements hash to the same bucket, searching for a key has to walk a list, making the search linear instead of O(1).
I know hashes have changed in the past so that the linked list hanging of each bucket cannot grow unbounded. But that means that worst case, you either have hash splits in o(n) (lowercase o), or search times that aren't O(1).
Any "mathematical proof" will have to look at the C code and consider how often hash splits occur worst case, and how long the linked lists can be.
Ah yes, I see.
While there are other data structures available to him (such as disk-based structures and tries), I chose to only speak about the ones he mentioned (hashes and arrays) due to time constraints.
You'll reduce the size of large passes you never have to do with a hash.
You'll speed up the sort you don't have to do with a hash.
With a hash, you never have to deal with more than $num_duplicates items. With an array, you'll deal with at least $num_duplicates items. I don't understand how come you say the array benefits more.
Re: mathematical proof
by gone2015 (Deacon) on Feb 03, 2009 at 14:43 UTC
So it's understood that:
So hash: ~ n + 2m; array: ~ 2n + n log n.
Not all work is the same, however. Scanning the array certainly requires ~ n comparisions, but this is in a Perl loop -- which may dominate the running time. Further, this does not account for the cost of reorganising the array to remove duplicates, if that's what you want to do...
However, that may not be the complete story. What do you intend to do with the information ?
Suppose you want to print out the unique keys in key order:
so the totals now are hash: ~ n + 4m + m log m; and the array ~ 3n + n log n + m. So, where there are a lot of duplicates the hash approach is doing less work; where there are few duplicates there's apparently not a lot in it.
Suppose you want to print out the unique keys in original order (duplicates sorted by first instance):
in this case the hash is a clear winner, not only is there no n log n component, but the work in the array sorting and scanning has clearly increased.
So what's the point of all this ? Well:
Or, in short, before embarking on big-O, you do have to consider at least the major components of the whole problem, and beware counting apples and oranges.
FWIW, the pseudo-code for keys in original file order is:... and with Perl, if it's less code it's generally faster !
Re: mathematical proof
by kyle (Abbot) on Feb 03, 2009 at 13:15 UTC
If you're trying to prove this for every file, I think you're doomed. In the special case where the file's keys are all small integers, the array will work faster than the hash. Operations on the array are generally faster, and its storage will be more compact to boot.
It's a rare special case, to be sure. Still, my point is that the input to some algorithm can have a significant effect on how efficient it is.
Consider this case I was testing one day:
Which is faster? It depends on how much of @big_array gets eliminated by the filter(). If most of the array is filtered out, it's faster to do it with grep. If little of the array is filtered out, it's faster to skip the extra loop (grep).
All that having been said, I'd use a hash for the task that you describe. I just wouldn't try to prove that it will always be the fastest way to do it.
Re: mathematical proof
by jdporter (Chancellor) on Feb 03, 2009 at 15:29 UTC
I came to the conclusion that the hash is O(1) and the array is O(n), but how can I mathematically prove it?
"I'm not good at math, but my theory is conceptually right, so all I need is for someone to express it in terms of equations."