http://qs321.pair.com?node_id=404105

A day or two ago there was a question posed to Seekers of Perl Wisdom regarding a situation where the OP needed to be able to retain the functionality of tandom lookup hashes, but in an application where the dataset was sufficiently large that tandem lookup hashes were too big to fit in memory.

In this case, when I refer to tandem lookup hashes, what I mean is that the values of one hash are the same as the keys of the other hash, and vice versa. In other words, if I wanted to be able to look up a person by name to determine his occupation, or by occupation to determine his name, I might use two hashes with mirror (opposite) images of each other.

Like a dummy I didn't, at first, see the part of the OP's question where he stated that for one reason or another, a database wasn't a plausable solution. Thus, I forged ahead with the suggestion that he use a database (like I said, ...dummy).

My logic at the time was that a two-column database, with both columns uniquely indexed, would be just about the quickest solution for cross lookups that would still scale well when the dataset grew to a size that couldn't be held in a pair of hashes in memory.

Obviously the database solution is not going to be as efficient as a pair of crossreferencing tandem hashes. But I started wondering just how bad things could get. ...so I set out to create a decent benchmark.

My benchmark script starts by creating 999 word pairs based on unique combinations of the top 999 most commonly used words in the English language. This happened to be a word list that I had sitting on my hard drive, and I find it kind of handy for this sort of tinkering. I've placed the word list at http://davido.perlmonk.org/monastery/t1000f.txt. The script uses LWP::Simple to grab this document and process it.

The script processes the word list by turning it into a bunch of word pairs. The word pairs are then inserted into a database (first-run only. If the DB already exists, this step is skipped). The database has both of its columns indexed for best lookup performance. It is a simple SQLite database.

Next, the script generates a forward and reverse lookup pair of hashes. It reports the size in bytes of these hashes of 999 word/word pairs, just for frame of reference.

Then, the benchmark script tests how fast it can choose at random a key from the %forward hash, and grab its value (repeat this for a few seconds). It then tests how fast it can do a database lookup based on a random key (repeat this test for a few seconds too).

And finally, the comparisons are reported. Here is the output:

Testing for existing word-pair database... Database found. Number of word pairs: 999. Combined size of %forward and %backward lookup hashes: 121566 bytes. Benchmarking... Rate Database Hash Database 1323/s -- -98% Hash 61897/s 4579% --

As you can see, the hash lookup is 4579% faster than the database lookup. (That's significant.) But next we should test scalability. That's a lot more interesting than comparing a hash to a database.

For that test, I just cut the word list in half. The half-size version is at http://davido.perlmonk.org/monastery/t500f.txt. Commenting out a couple of lines, and uncommenting two others is all it takes to switch the benchmark to use the shorter word list. And here are the results:

<snip> Number of word pairs: 499. Combined size of %forward and %backward lookup hashes: 59934 bytes. Benchmarking... Rate Database Hash Database 1389/s -- -98% Hash 65558/s 4621% --

We already know that hash lookups are essentially O(1) (no growth as dataset grows). By comparing the two database benchmarks to each other, however, we can learn a little about the order of growth of indexed database lookups as the dataset grows. As we cut the dataset from 999 elements to 499 elements, we see a 5.9% increase in lookup speed. That seems pretty small, but I don't know enough about the SQLite database engine to know what the big-O order of growth of lookup times is as datasets grow. ...perhaps someone could fill in the blank there.

Here is the code used for the benchmarking. Feel free to run it. On first run you'll see a warning as it tests for database existance. Don't worry about that, the warning is just a noisy part of table detection. The script requires DBD::SQLite, DBI, Devel::Size, LWP::Simple, and Benchmark. Enjoy!

#!perl use strict; use warnings; use Devel::Size qw( total_size ); use LWP::Simple; use Benchmark qw( cmpthese ); use DBI; # Comment out either the first declaration of $database and $url, # or the second declaration, depending on which test you wish # to run. This is to test scalability. #my $database = 'allwords.db'; #my $url = 'http://davido.perlmonk.org/monastery/t1000f.txt'; my $database = 'halfwords.db'; my $url = 'http://davido.perlmonk.org/monastery/t500f.txt'; # -------------- my( %forward, %backward ); my $dbh = DBI->connect( "DBI:SQLite:$database", '', '', { RaiseError => 1, AutoCommit => 0 } ); { print "Testing for existing word-pair database...\n"; eval { my $sth = $dbh->prepare( "SELECT * FROM wordtable WHERE 0=1" ); $sth->execute; $sth->finish; }; if( $@ ) { print "Database not found. It will be created.\n"; { print "Pulling words list from server.\n"; my @words = split /\s+/, get( $url); @forward{ @words } = reverse @words; @backward{ reverse @words } = @words; } print "Creating word-pair database...\n"; $dbh->do( "CREATE TABLE wordtable ( Left varchar, Right varchar ) " ); $dbh->do( "CREATE UNIQUE INDEX Index_Left ON wordtable ( Left )" ); $dbh->do( "CREATE UNIQUE INDEX Index_Right ON wordtable ( Right )" ); my $sth = $dbh->prepare( "INSERT INTO wordtable ( Left, Right ) VALUES ( ?, ? )" ); while ( my( $left, $right ) = each %forward ) { $sth->execute( $left, $right ); } $sth->finish; $dbh->commit; print "Database created.\n"; } else { print "Database found.\n"; my $sth = $dbh->prepare( "SELECT * FROM wordtable" ); $sth->execute(); while ( my( @pair ) = $sth->fetchrow_array() ) { $forward{ $pair[0] } = $pair[1]; $backward{ $pair[1] } = $pair[0]; } $sth->finish(); } } my @find = keys %forward; my $sth = $dbh->prepare( "SELECT Right FROM wordtable WHERE Left = ?" ); print "Number of word pairs: ", scalar @find, ".\n"; print "Combined size of \%forward and \%backward lookup hashes: ", total_size( \%forward ) + total_size( \%backward ), " bytes.\n"; print "Benchmarking...\n"; cmpthese( -5, { Hash => sub { my $value = $forward{ $find[ rand( @find ) ] }; }, Database => sub { $sth->execute( $find[ rand( @find ) ] ); my($value) = ( $sth->fetchrow_array() ); }, } ); $sth->finish(); $dbh->disconnect();

In retrospect I wish I had written this in a way that made it easier to compare the small DB lookups to the large-DB lookups, but I didn't at first think about doing that comparison. Note: The script is set to test the 499-element word set. For the 999-word set, just change which declaration lines are commented out at the top of the script. This is a pretty quick-n-dirty script. It's far from elegant. ;) Have fun.


Dave

Replies are listed 'Best First'.
Re: Hash lookups, Database lookups, and Scalability
by tilly (Archbishop) on Oct 31, 2004 at 09:07 UTC
    I'd expect most database lookups to scale like O(log(n)) for a lookup. For scalability I'd also mainly be interested in the case where you had enough data that it doesn't all get stored in RAM, which would require a far larger dataset than you have.
      "I'd expect most database lookups to scale like O(log(n)) for a lookup"

      Why? Don't simplify the situation. Database search could range from indexed search to a full table scan. As for indexed search, lots of RDBMS system's look up is indeed hash look up, and index is the way you tell database what hashes to create. Those numbers the OP given is not purely for look up, instead it is a mixture of everything including IO and network communication, thus no way they can be used as it is, to measure the performance of the database search algorithm.

      As for scalability, I want to stress that there is a big SCALABILITY here: to be able to do a tandem look up, you have to come up with extra code and special data structure when you go with hash look up, but not for database look up. What if you need look up ability with 5 different forms of keys? with database, it is simply a different query, at most some new indexes; with hash look up, you have to SCALE UP your coding effort and complicate your code.

        "I'd expect most database lookups to scale like O(log(n)) for a lookup"
        Why? Don't simplify the situation. Database search could range from indexed search to a full table scan. As for indexed search, lots of RDBMS system's look up is indeed hash look up, and index is the way you tell database what hashes to create. Those numbers the OP given is not purely for look up, instead it is a mixture of everything including IO and network communication, thus no way they can be used as it is, to measure the performance of the database search algorithm.
        My statement was entirely based on theory.

        The description given was selecting a single value from a table with indexes on both columns. That means that the lookup is happening on an index. For the big-O estimate you have to look at what happens as the dataset gets large. Network communication is a constant factor and I/O is part of the search time.

        My statement about O(log(n)) therefore presumes that you are using an index with a large dataset. The question then becomes what kind of index. There are many kinds of indexes out there. Yes, you can use a hash and get O(1). But default indexes tend to be a hierarchical datastructure that is O(log(n)), because they cooperate better with caching to avoid I/O, leading to a much better constant. (That is why the search algorithm should be chosen with I/0 in mind.)

        About your coding comments, I have no disagreement with that and have said similar things on many occasions myself.

Re: Hash lookups, Database lookups, and Scalability
by mpeppler (Vicar) on Oct 31, 2004 at 13:13 UTC
    Assuming a unique index (i.e. one row returned for any matched key) the database should be highly scalable. The speed is almost entirely dependant on the number of physical IO operations (reads) that are required to find and return the required data. For example, in a quick test with Sybase, on a 4.5 million row table a query that looks up a row based on a unique index finds that row with only 2 physical disk reads. On a 15 million row table a similar query takes 3 physical IOs. The server finds this row in 10 milliseconds if the data isn't cached, and requires no measurable time (based on its internal measuring tools) to fetch the data if it is cached.

    Obviously you have lots of overhead (network IO, etc) over fetching data from an in-memory hash, but a modern RDBMS server will scale to hundreds of millions of rows for simple one-to-one lookups with no problems at all (BTW - these timings were done a dual Xeon linux server, hardly heavy-duty hardware as far as database servers are concerned...)

    Michael

      Also, if the table has a clustered index (Sybase/SQL Server) or the table is index organized (Oracle) or somesuch, there is no extra disk IO to read the row data.

      (From your description with 2 physical reads, it sounds like you used a clustered PK index.)

        There are various issues that enter into calculating the number of IOs. If the index covers the query (i.e. all the columns that are required for the output are part of the index), then only the index page/row needs to be read.

        If a clustered index is used, then with Sybase 11.9.2 and later if the table uses "all pages" locking then the leaf index page and the data page are one and the same, but if the table uses "data only" locking (commonly referred to as DOL) then the index leaf pages and the data pages are separate.

        Which all means that estimating the number of IOs for a particular query can be a lot of fun, and is the reason why cost-based optimizers (such as the one Sybase uses) are pretty complicated beasts...

        Michael

Re: Hash lookups, Database lookups, and Scalability
by pg (Canon) on Oct 31, 2004 at 17:40 UTC

    First, DBD::SQLite is not a sound database any way. Also there is a big performance boost when you wrap your queries in DBD::SQLite's transaction. I didn't see you use it, and it would be interesting to get another benchmark with that enabled.

    In general, database search is less efficient than in-memory ones. First you have more freedom to tailor your algorithm and data structure for in-memory search; secondly for database search, the results are transferred through network communication, two-way communication for each query, which obviously costs extra time. How good is the database's cache ability will also impact the performance.

    However this does not mean that database search is not a good viable solution. Database gives you lots of important benefits than just speed, which is OT here.

      I'm curious what you mean by "DBD::SQLite is not a sound database". I know it's a single file, lightweight, minimal-SQL set database, but to me something that isn't sound is buggy. Is that what you're saying? Just curious. (By the way, I just discovered a GPF-producing bug if I switch my code to use selectrow_array() with a pre-prepared statement on Win32, so maybe you've got a point. ;)

      And yes, I fully expected the database solution to be much slower than the in-memory hash solution. I mentioned that in my post. I was attempting to quantify how much slower, and how sensitive the database version is to dataset-size.

      Wrapping in transactions definately helps. Take the example of my building the database initially. With transactions, it takes the blink of an eye, but without, it takes about 45 seconds to build the database on my system. But on the query side, I'm specifically trying to engage in single queries, and didn't want to see transactions influencing the speed. However, I agree it would be interesting to rewrite the test in such a way that it could take advantage of transactions, and I may do that just to see what comes of it.

      Anyway, it's kind of fun to tinker with this stuff. ...hope you agree, even it the database discussion is diverging a little from core-Perl topics.


      Dave

        "Anyway, it's kind of fun to tinker with this stuff. ...hope you agree, even it the database discussion is diverging a little from core-Perl topics. "

        I agree 100 percent. I did notice that recently we are having more and more database related discussions here, which is very possitive, at least to my standard. This indicates to me that more and more people are using Perl for an increased range of application areas, and most likely the average size of the applications is going upwards. This is definitely a positive news.

    A reply falls below the community's threshold of quality. You may see it by logging in.