Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^4: performance issues with grepping large sorted arrays

by gam3 (Curate)
on Jul 24, 2007 at 14:25 UTC ( [id://628472]=note: print w/replies, xml ) Need Help??


in reply to Re^3: performance issues with grepping large sorted arrays
in thread performance issues with grepping large sorted arrays

The point is: grep is O(n) the short circuit for loop is O(n/2) and O(n) == O(n/2).
-- gam3
A picture is worth a thousand words, but takes 200K.
  • Comment on Re^4: performance issues with grepping large sorted arrays

Replies are listed 'Best First'.
Re^5: performance issues with grepping large sorted arrays
by mr_mischief (Monsignor) on Jul 24, 2007 at 16:43 UTC
    I'd like to ote that gam3 is noting average case behavior for short-circuiting the linear search here, not worst-case.

    O(n/2) is as good as you can expect short-circuited linear search to get over time. Absolute best case is, of course, O(1) if you happen to find what you want at the beginning of the list. Worst case is O(n). All this simplifies to terms of O(n).

    Binary search is much more efficient (save for very small lists where the setup overhead is large). It is O(log2(n)). For 800k elements, you're talking about linear search being 800k comparisons, shorted linear 400k (worst case 800k), and binary search being about 20 comparisons. There's not much to argue here.

    In this particular thread, the sort then 1000 searches is still going to be blown away by inverting the loop and not doing any searches at all. Thus is the power of the hash lookup.
Re^5: performance issues with grepping large sorted arrays
by almut (Canon) on Jul 24, 2007 at 17:35 UTC
    The point is: ...

    If you think that this is The point, could you elaborate on its practical relevance with respect to the statement I originally (implicitly) made, which is just that skipping unnecessary iterations/comparisons is faster (in the typical case) than doing full iterations every time (like grep). This holds even though the upper bound expressed in both algorithms' Big O complexity classification is the same. I never made any statement as to the method being O(N) or not. Also, no doubt that a binary search on a sorted list would be faster...

      Indeed, if there's no other reason to have the array sorted than to do a binary search instead of a short-circuited linear search, there may not be much overall difference between the n/2 vs. the n * log2(n) + log2(n).

      In this case, we're talking about m*n, so O(m*n/2) vs. O( n * log2(n) + m * log2(n) ). (1000 * 400k) for short-circuit linear with no sort or (800k * 20 + 1000 * 20) for sort and binary search. 400,000,000 vs. 16,020,000 typical. Although it could be worst case 800,000,000 vs 640,000,020,000.

      So between the two, there's better typical time on the quicksort and binary search, but more time stability on the short-circuited linear search with no sort. Pathological cases would hopefully be rare.

      All of which is still, in this case, much ado about nothing. There should be 800,000 hash lookups for comparison and no searching. The method with 0.002 as many comparisons as one or 0.05 as many comparisons as the other is the clear winner.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://628472]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-03-29 15:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found