comment on

A day or two ago there was a question posed to Seekers of Perl Wisdom regarding a situation where the OP needed to be able to retain the functionality of tandom lookup hashes, but in an application where the dataset was sufficiently large that tandem lookup hashes were too big to fit in memory.

In this case, when I refer to tandem lookup hashes, what I mean is that the values of one hash are the same as the keys of the other hash, and vice versa. In other words, if I wanted to be able to look up a person by name to determine his occupation, or by occupation to determine his name, I might use two hashes with mirror (opposite) images of each other.

Like a dummy I didn't, at first, see the part of the OP's question where he stated that for one reason or another, a database wasn't a plausable solution. Thus, I forged ahead with the suggestion that he use a database (like I said, ...dummy).

My logic at the time was that a two-column database, with both columns uniquely indexed, would be just about the quickest solution for cross lookups that would still scale well when the dataset grew to a size that couldn't be held in a pair of hashes in memory.

Obviously the database solution is not going to be as efficient as a pair of crossreferencing tandem hashes. But I started wondering just how bad things could get. ...so I set out to create a decent benchmark.

My benchmark script starts by creating 999 word pairs based on unique combinations of the top 999 most commonly used words in the English language. This happened to be a word list that I had sitting on my hard drive, and I find it kind of handy for this sort of tinkering. I've placed the word list at http://davido.perlmonk.org/monastery/t1000f.txt. The script uses LWP::Simple to grab this document and process it.

The script processes the word list by turning it into a bunch of word pairs. The word pairs are then inserted into a database (first-run only. If the DB already exists, this step is skipped). The database has both of its columns indexed for best lookup performance. It is a simple SQLite database.

Next, the script generates a forward and reverse lookup pair of hashes. It reports the size in bytes of these hashes of 999 word/word pairs, just for frame of reference.

Then, the benchmark script tests how fast it can choose at random a key from the %forward hash, and grab its value (repeat this for a few seconds). It then tests how fast it can do a database lookup based on a random key (repeat this test for a few seconds too).

And finally, the comparisons are reported. Here is the output:

Testing for existing word-pair database...
Database found.
Number of word pairs: 999.
Combined size of %forward and %backward lookup hashes: 121566 bytes.
Benchmarking...
            Rate Database     Hash
Database  1323/s       --     -98%
Hash     61897/s    4579%       --
[download]

As you can see, the hash lookup is 4579% faster than the database lookup. (That's significant.) But next we should test scalability. That's a lot more interesting than comparing a hash to a database.

For that test, I just cut the word list in half. The half-size version is at http://davido.perlmonk.org/monastery/t500f.txt. Commenting out a couple of lines, and uncommenting two others is all it takes to switch the benchmark to use the shorter word list. And here are the results:

<snip>
Number of word pairs: 499.
Combined size of %forward and %backward lookup hashes: 59934 bytes.
Benchmarking...
            Rate Database     Hash
Database  1389/s       --     -98%
Hash     65558/s    4621%       --
[download]

We already know that hash lookups are essentially O(1) (no growth as dataset grows). By comparing the two database benchmarks to each other, however, we can learn a little about the order of growth of indexed database lookups as the dataset grows. As we cut the dataset from 999 elements to 499 elements, we see a 5.9% increase in lookup speed. That seems pretty small, but I don't know enough about the SQLite database engine to know what the big-O order of growth of lookup times is as datasets grow. ...perhaps someone could fill in the blank there.

Here is the code used for the benchmarking. Feel free to run it. On first run you'll see a warning as it tests for database existance. Don't worry about that, the warning is just a noisy part of table detection. The script requires DBD::SQLite, DBI, Devel::Size, LWP::Simple, and Benchmark. Enjoy!

#!perl

use strict;
use warnings;

use Devel::Size qw( total_size );
use LWP::Simple;
use Benchmark qw( cmpthese );
use DBI;


# Comment out either the first declaration of $database and $url,
# or the second declaration, depending on which test you wish
# to run.  This is to test scalability.

#my $database = 'allwords.db';
#my $url = 'http://davido.perlmonk.org/monastery/t1000f.txt';

my $database = 'halfwords.db';
my $url = 'http://davido.perlmonk.org/monastery/t500f.txt';

# --------------

my( %forward, %backward );

my $dbh = DBI->connect( "DBI:SQLite:$database", 
                        '', '', 
                        { RaiseError => 1, 
                          AutoCommit => 0 } 
);

{
    print "Testing for existing word-pair database...\n";
    eval {  my $sth = $dbh->prepare( 
                "SELECT * FROM wordtable WHERE 0=1" 
            );
            $sth->execute;
            $sth->finish;
    };
    if( $@ ) {
        print "Database not found.  It will be created.\n";
        {
            print "Pulling words list from server.\n";
            my @words = split /\s+/, get( $url);
            @forward{ @words } = reverse @words;
            @backward{ reverse @words } = @words;
        }
        print "Creating word-pair database...\n";
        $dbh->do(
            "CREATE TABLE wordtable
                ( Left  varchar,
                  Right varchar
             ) " 
        );
        $dbh->do(
            "CREATE UNIQUE INDEX Index_Left
             ON wordtable ( Left )" 
        );
        $dbh->do(
            "CREATE UNIQUE INDEX Index_Right
             ON wordtable ( Right )"
        );
        my $sth = $dbh->prepare( 
            "INSERT INTO wordtable
             ( Left, Right ) VALUES ( ?, ? )" 
        );
        while ( my( $left, $right ) = each %forward ) {
            $sth->execute( $left, $right );
        }
        $sth->finish;
        $dbh->commit;
        print "Database created.\n";
    } else {
        print "Database found.\n";
        my $sth = $dbh->prepare( "SELECT * FROM wordtable" );
        $sth->execute();
        while ( my( @pair ) = $sth->fetchrow_array() ) {
            $forward{ $pair[0] } = $pair[1];
            $backward{ $pair[1] } = $pair[0];
        }
        $sth->finish();
    }
}

my @find = keys %forward;

my $sth = $dbh->prepare(
    "SELECT Right FROM wordtable WHERE Left = ?" );

print "Number of word pairs: ",
      scalar @find,
      ".\n";
print "Combined size of \%forward and \%backward lookup hashes: ", 
      total_size( \%forward ) + total_size( \%backward ),
      " bytes.\n";

print "Benchmarking...\n";

cmpthese( -5, {
    Hash => sub { 
        my $value = $forward{ $find[ rand( @find ) ] }; 
    },
    Database => sub {
        $sth->execute( $find[ rand( @find ) ] );
        my($value) = ( $sth->fetchrow_array() );
    },
} );

$sth->finish();

$dbh->disconnect();
[download]

In retrospect I wish I had written this in a way that made it easier to compare the small DB lookups to the large-DB lookups, but I didn't at first think about doing that comparison. Note: The script is set to test the 499-element word set. For the 999-word set, just change which declaration lines are commented out at the top of the script. This is a pretty quick-n-dirty script. It's far from elegant. ;) Have fun.

Dave

In reply to Hash lookups, Database lookups, and Scalability by davido

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl: the Markov chain saw
	PerlMonks