Re^3: find closest element of array without going over

It's all a matter of perspective. In the OP's query, he had 18 items in his list. Taking your code, changing the $search_for to be $limit/2 to better approximate actual random values, and using an input of 17 (since you seem to add one), I get:

$ perl ./x.pl 17
Done generating list
9 9
           Rate brutal binary
brutal 449367/s     --   -17%
binary 544651/s    21%     --
[download]

So the difference is ~20%. But what are we really talking about here? On my system, that difference is really dropping from a binary search's 2.23 µs to 1.84 µs. A savings of 0.39 µs. Seriously, we're quibbling here? (Binary search should do REALLY well here, too, picking nearly the first item, I think)

In your cases, I see at 100 (101, really, but again, let's not quibble about small details) dropping from 5.4 µs to 4.5 µs, a savings of 0.88 µseconds. At 1000, it's 48.6 µs vs 6.41 µs, or 42.2 µs savings. Much bigger numbers. But, seriously, is it a concern?

At 10,000, it's 461 µs vs 7.75 µs, or a savings of 454 µs. We're still under a thousandth of a second here, folks. Even at 100,000 items in the (pre-)sorted list, we're comparing 5.92 ms vs 10.9 µs. Sure, that's a savings of nearly the whole thing, but that's still only saving 5.91 ms. Really, do we care?

Now, granted, you ran these tests skewed (looking for something 33% in instead of half-way, otherwise binary search wouldn't even have to do anything), but the reality is (and I think starbolin's point is) that doing this search really doesn't take much time. Worrying about this before you've profiled anything is simply premature optimisation, and you'd get bigger bang for your buck (that being programmer's time) if you'd spend it on something else. The reward for all the time spent coding, testing, and fixing a binary search just does not pay for itself, at least not if you're merely running it once, for a small number (i.e., less than a million, it seems).

Changing $search_for to my $search_for = $a[int($limit * (1/2 - 1/9))];, and rerunning at 1_000_000 entries shows a savings of about 66.7ms on my machine. It's simply not worth it. Running one more time for 10 million, and the savings is 775 ms (from 775ms minus 14.1 µs). Now it's starting to be worth it. If it's being run interactively. If it's a long-running program that users expect to walk away from and come back, then even that's no big deal.

Of course, this may all be moot if it's code for perlmonks or slashdot or some other active site, but you still should profile before worrying about such small improvements. Chances are, your bottlenecks still are elsewhere.

Update: I should point out that I realise that bunchmarking a static index for an O(n) vs O(lg n) comparison is inherently unfair, and an equal distribution across the entire problem space would need to be concocted for truely accurate numbers. However, even if that resulted in numbers two or three times what I provide above (which would not be the case since the upper limit is on the brute-force method, and that would not increase by much), that would not change the conclusion. 18ms is not much more time than 6ms (and the true number would probably be still under 8ms at 100,000 on this machine).

Comment on Re^3: find closest element of array without going over Select or Download Code

Replies are listed 'Best First'.
Re^4: find closest element of array without going over by moritz (Cardinal) on Jun 26, 2008 at 20:36 UTC
If sadmanface hadn't worried about speed, he wouldn't have asked that question, I assume. Since I couldn't give an algorithm that is faster in the case of small numbers, I gave one that is very much faster for big numbers. He can chose whatever is best for his application. As to the difference of absolute times, I can just say that it scales with number of times the algorithm is run. Again with can just try to credit sadfaceman some intelligence and assume he wouldn't ask his question about this particular item of code if it wasn't of importance. After all, it cost him some time to ask as well. Yes, it's a good idea to benchmark, but it still don't think that's a reason to recommend an algorithm that scales bad in the general case, unless you state its limits. And I don't think it's bad to recommend an algorithm that scales well in the general case, even it might be a bit slower for small data structures.	[reply]
Re^5: find closest element of array without going over by Tanktalus (Canon) on Jun 26, 2008 at 21:03 UTC
If sadmanface hadn't worried about speed, he wouldn't have asked that question, I assume. You must be new here. (j/k) Seriously, just look for all the conversations we've had on PM about "premature optimisation" (or, for the 'mericans, "premature optimization"). As in, one of the roots of all evil. It's actually fairly common for new monks to ask about speeding things up as if they had to optimise every single cycle, and then to focus in on an area of no importance. For example, we don't know if sadmanface's program runs on a heavily-loaded web server dealing with data from mySQL that is likely in a cache and thus pre-loaded into memory, or if it's reading its data from a series of files across NFS and/or samba where even at 10,000,000 items, the I/O would so completely overwhelm the timing as to make any other modification completely moot (chopping seconds, ENTIRE WHOLE SECONDS, off a 95-minute execution). So, you're right, we don't know sadmanface's predicament. But experience shows that assuming that someone who is worried about speed is actually correct to worry about it is a mighty optimistic assumption. This is not an intelligence test here - it's an experience one. Before I joined PM, I probably would have asked similar questions about micro-optimisation. Having been exposed to the Wirebrush of Enlightenment on this issue, though, I now try to focus on bigger picture items, and only worry about small items when it becomes apparent that the small item in question is being magnified by being called too often (where I usually just go for caching instead, though probably not in this type of case), such as in a tight loop. I don't think my intelligence has grown over that time (heck, I have a kid now, so, if anything, it has dropped), but my experience has. As to an algorithm that scales bad in the general case: again, if you're never going to need to scale, the time you spend on developing the advanced algorithm is wasted. If you can save yourself a half-day by developing the slow, poorly-scaled algorithm which you can then use to implement some other feature your user(s) actually cares about, you'll be far ahead. And that may also be something that sadfaceman has overlooked. Now, all that said, sadfaceman may actually have thought about all these issues and done the due diligence to prove (or at least provide sufficiently strong evidence toward) this as his bottleneck. In which case, he has received ample helpful advice. My experience on PM doesn't preclude this possibility, but the statistics aren't in favour of this possibility.	[reply]


"be consistent"
	PerlMonks