in reply to Hash lookups, Database lookups, and Scalability

Assuming a unique index (i.e. one row returned for any matched key) the database should be highly scalable. The speed is almost entirely dependant on the number of physical IO operations (reads) that are required to find and return the required data. For example, in a quick test with Sybase, on a 4.5 million row table a query that looks up a row based on a unique index finds that row with only 2 physical disk reads. On a 15 million row table a similar query takes 3 physical IOs. The server finds this row in 10 milliseconds if the data isn't cached, and requires no measurable time (based on its internal measuring tools) to fetch the data if it is cached.

Obviously you have lots of overhead (network IO, etc) over fetching data from an in-memory hash, but a modern RDBMS server will scale to hundreds of millions of rows for simple one-to-one lookups with no problems at all (BTW - these timings were done a dual Xeon linux server, hardly heavy-duty hardware as far as database servers are concerned...)


  • Comment on Re: Hash lookups, Database lookups, and Scalability

Replies are listed 'Best First'.
Re^2: Hash lookups, Database lookups, and Scalability
by jplindstrom (Monsignor) on Oct 31, 2004 at 15:11 UTC
    Also, if the table has a clustered index (Sybase/SQL Server) or the table is index organized (Oracle) or somesuch, there is no extra disk IO to read the row data.

    (From your description with 2 physical reads, it sounds like you used a clustered PK index.)

      There are various issues that enter into calculating the number of IOs. If the index covers the query (i.e. all the columns that are required for the output are part of the index), then only the index page/row needs to be read.

      If a clustered index is used, then with Sybase 11.9.2 and later if the table uses "all pages" locking then the leaf index page and the data page are one and the same, but if the table uses "data only" locking (commonly referred to as DOL) then the index leaf pages and the data pages are separate.

      Which all means that estimating the number of IOs for a particular query can be a lot of fun, and is the reason why cost-based optimizers (such as the one Sybase uses) are pretty complicated beasts...


        I see that a properly constructed index is going to be vital for optimum performance. Given the criteria of the dual crossreferenced lookups, how might I better construct the indices in the code I posted? I'm curious to see if the DB solution can be better optimized.