Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^2: performance issues with grepping large sorted arrays

by Trizor (Pilgrim)
on Jul 23, 2007 at 23:34 UTC ( [id://628361]=note: print w/replies, xml ) Need Help??


in reply to Re: performance issues with grepping large sorted arrays
in thread performance issues with grepping large sorted arrays

Arrays don't require a linear search, see my post about binary search. It requires a sorted array, but if you've already got that...

In terms of algorithmic complexity short circuiting has no impact on efficiency because the worst case is still O(n) and the average case is O(n/2) which simplifies to O(n), which is still greater than O(lg n)

  • Comment on Re^2: performance issues with grepping large sorted arrays

Replies are listed 'Best First'.
Re^3: performance issues with grepping large sorted arrays
by almut (Canon) on Jul 24, 2007 at 00:16 UTC

    I absolutely agree that a binary search would be faster if you have a sorted array (though sorting the array in the first place would also take quite some time, and I was under the impression that the OP would only be doing that in an attempt to speed up things).

    As to "the worst case is still O(n) and the average case is O(n/2) which simplifies to O(n)": I think in most real life situations, you're very unlikely to encounter the worst case scenario (in the OP's case, it would mean that every $key being looked up would match the last element of the array, or is not found at all). Sure, ultimately it's a statistical issue, but in the typical case there will be some improvement when short-circuiting the search. But that's a moot point anyway, as there are even better alternatives... (like hash tables, with constant-time lookup on average)

      The point is: grep is O(n) the short circuit for loop is O(n/2) and O(n) == O(n/2).
      -- gam3
      A picture is worth a thousand words, but takes 200K.
        I'd like to ote that gam3 is noting average case behavior for short-circuiting the linear search here, not worst-case.

        O(n/2) is as good as you can expect short-circuited linear search to get over time. Absolute best case is, of course, O(1) if you happen to find what you want at the beginning of the list. Worst case is O(n). All this simplifies to terms of O(n).

        Binary search is much more efficient (save for very small lists where the setup overhead is large). It is O(log2(n)). For 800k elements, you're talking about linear search being 800k comparisons, shorted linear 400k (worst case 800k), and binary search being about 20 comparisons. There's not much to argue here.

        In this particular thread, the sort then 1000 searches is still going to be blown away by inverting the loop and not doing any searches at all. Thus is the power of the hash lookup.
        The point is: ...

        If you think that this is The point, could you elaborate on its practical relevance with respect to the statement I originally (implicitly) made, which is just that skipping unnecessary iterations/comparisons is faster (in the typical case) than doing full iterations every time (like grep). This holds even though the upper bound expressed in both algorithms' Big O complexity classification is the same. I never made any statement as to the method being O(N) or not. Also, no doubt that a binary search on a sorted list would be faster...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://628361]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-04-20 11:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found