Re: A short meditation about hash search performance

Replies are listed 'Best First'.
Re: Re: A short meditation about hash search performance by pg (Canon) on Nov 16, 2003 at 03:19 UTC
"And a billion differs from 1 only by a constant - so that's O(1)" You obviously don't understand what O(1) means. Say we have an array of 1 billion elements. Let's look at two different search algorithms: Search from beginning to end, going thru each element one by one, until hit what you are searching for. In the worst case (the element is at the end of the array), you have to hit 1 billion elements, but according to you, that's O(1). I say it is O(n). We never put a restriction saying that an array can at most contain 1 billion elements (so the size of an array in general is not a constant, although it is a constant for a given array at one given observation point.) Do a binay search, in the worst case, you have to hit log2(1 billion) ~ 30 times. I call this O(log2(n)), according to you it is also O(1). As everyone knows, the performance of those two approaches are so different, but according to your theory, they are both O(1)! The math here is so off! Well... I certainly don't mind if you insist your idea, but please don't confuse the general public. What you said would be right, if we put a restriction saying that a hash can contain at most 1 billion elements. As O(1 billion) has the same complexity as O(1), even though 1 billion is much bigger than 1. However O(n) is more complex than O(1 billion), even comparing with O(1 billion 1 billion), O(n) is still more complex. Why? because n is a variable, which can go to unlimit. 1 billion 1billion is huge, but n is going to unlimit, and evetually it will pass 1 billion ** 1 billion. In our context, please remember that, the size of a hash is a variable (that potentially goes to unlimit), and your analysis has to reflect this fact. Don't confuse it with the size of a given hash at a given time.	[reply]
Re: A short meditation about hash search performance by Abigail-II (Bishop) on Nov 16, 2003 at 23:03 UTC
You obviously don't understand what O(1) means. Let's see. The definition of big O is: `f(n) = O (g (n)) iff there are a M > 0 and a c > 0 such that for all m > M, 0 <= f(m) <= c * g (m). [1] [ +2] [3]` [download] I don't have any problem understanding with it. In layman terms, it means that a function `f` of `n` is in the order of `g` of `n`, if, and only if, there's a constant, such that if `n` gets large enough, the value of `f` is at most the value of `g` times said constant. Search from beginning to end, going thru each element one by one, until hit what you are searching for. In the worst case (the element is at the end of the array), you have to hit 1 billion elements, but according to you, that's O(1). I say it is O(n). We never put a restriction saying that an array can at most contain 1 billion elements (so the size of an array in general is not a constant, although it is a constant for a given array at one given observation point.) Hello? We never put a restriction on the size? Come again. What do you call: And still O(1) is not reachable, unless each element resolve a unique key ;-) That's a restriction of 1. You started out by putting restrictions on it, claiming that only if there's a restriction of a size of 1, the search algorithm is O (1). I on the other hand pointed out that as long as there is a restriction on the limit of the chain, it doesn't matter what the restriction is, 1, 14 (for 5.8.2), or a billion. If there's a restriction on the size, even with a linear search it's O (1). Here's a proof: Suppose the chain is limited to length K, where K is a constant, independent of the amount of keys in the hash. Searching for a key is a two step process: first we need to find the bucket the key hashes to, then we need to find the key in the associated chain. Finding the right bucket takes constant time. Traversing the chain takes at most K * e time, for some constant e. So, searching for the element takes at most: e * K + O (1), e >= 0 {definition of O()} <= e * K + d * 1, e >= 0, d >= 0 {arithmetic} == (e * K + d) * 1, e >= 0, d >= 0 {c == e * K + d} == c * 1 {c > 0} == O (1). q.e.d. I won't deny the performance will be rather lousy, but it's still O (1). Which proves that big-Oh doesn't say everything. `[1]` Cormen, Leiserson, and Rivest: Introduction to Algorithms. MIT Press, 1990. pp 26. `[2]` Knuth: The Art of Computer Programming, Volume 1: Fundamental Algorithms. Third Edition. Addison-Wesley, 1997. pp 107. `[3]` Sedgewick, and Flajolet: Analysis of Algorithms. Addison-Wesley, 1996. pp 4. Abigail	[reply] [d/l] [select]
Re: Re: A short meditation about hash search performance by demerphq (Chancellor) on Nov 17, 2003 at 08:58 UTC
Abigail, I have two minor questions for you. First off you speak of finding the correct bucket as occurring in constant time. Given that the time to calculate the bucket value is dependent on the length of the key I dont quite see how this is correct. Or does this factor disappear because it averages to a constant time in normal use? I have a similar concern about the doubling of the buckets during insertion. My by now hazy recollection of big O() says that this behaviour is signifigant and should be included in the O() of hash insertion. Is this wrong? If its not wrong how would it be calculated? I havent the foggiest how you would calculate the effect of a factor that comes into play so rarely. Or is it again that it averages to 0 and so can be left out of the equation? --- demerphq _{First they ignore you, then they laugh at you, then they fight you, then you win. -- Gandhi}	[reply] [d/l]
Re: A short meditation about hash search performance by Abigail-II (Bishop) on Nov 17, 2003 at 09:34 UTC
Re: Re: Re: A short meditation about hash search performance by Boots111 (Hermit) on Nov 16, 2003 at 20:40 UTC
All~ I am just refering to the two posts immediately above this, but I must point out that pg is correct. Despite what the points on either node may say... The size of a hashtable is a variable (usually n), and the pathelogical case of inserting everything into the same bucket provides O(n) access for a simple hashtable. The only way in which Abigail would be correct is if there were guarantee that the overflow chain would NEVER exceed one billion entries. It is possible that the rehashing will prevent overflow chains from growing too large, but then one must consider the cost of rehashing the table. While that cost is not paid every time, it is likely a very large cost, and thus must be amortized across all calls to insert. In general, one could get O(1) access to a hash by ensuring that the overflow chains reach at most a constant length, but this will require rehashing when chains get too long. This would cause hash insertions to be greater than O(1). At heart it is a question of trading one cost for another... Boots --- Computer science is merely the post-Turing decline of formal systems theory. --???	[reply]


Perl: the Markov chain saw
	PerlMonks