comment on

Okay. I have my first pass solution completed, and here (as promised) is the code (in the form of an Inline::C script for RAD and testing), and a brief description.

Description:

The code presents a class: BiMap; which supports the following methods:

my $bm = BiMap::new( (U32)$initSize, (double)$factor );
initSize is the initial size the tables are allocated to.
(Fill) factor is a double representing the proportion of the table that will fill before it is resized to double its current size.
$bm->add( (u64)addr, (char*)$name );
Add a paired integer and string to the BiMap.
(Currently) returns the number of (additional) probes required to add the pair to both tables. (For debugging and tuning purposes.)
my $addr = $bm->findByStr( (char*)$name );
Lookup an integer from its paired string.
my $name = $bm->findByInt( U64 i );
Lookup a string from its paired integer.
Some utility/debug methods that may or may not persist into the final version:
- my $size = $bm->size;
  (Getter only!)
- my $used = $bm->used;
  (Getter only!)
- my $factor = $bm->factor;
  (Getter only!)
- $bm->dump( (BOOL)$flag );
  Dumps the BiMap to stdout. (Should probably be to stderr, but attempting to use stderr from I::C causes segfaults.)
  If flag is false only dumps the header rather than the full table.

The basic structure of the BiMap contains pointers to two parallel arrays: byInt[] of PAIR structures; byStr[] of U32 integers;

typedef struct {
    U64 addr;
    char *name;
} PAIR;

typedef struct {
    PAIR **byInt;
    U32   *byStr;
    U32    size;
    U32    used;
    double factor;
} BIMAP;
[download]

The PAIRs are inserted into byInt[] hashed by the (U64)addr in the PAIR.

The byStr[] holds indexes (+1 to allow 0 to represent empty), of the PAIRs, hashed by the (char*)name in the PAIR. Using indexes into byInt[] rather than address of the pairs themselves saves half the space at the cost of an extra indirection.

As a large majority of the names will be less than 8 characters; when that is the case the char* name component of the PAIR is used to hold the string directly, rather than a pointer to it, thus saving another 8-bytes per compliant PAIR. The high bit of the 8th byte is set to distinguish between directly stored strings and addresses of strings.

I tested several combinations of: a) various hash functions; and b) various probing algorithms; and in my (simplistic) testing; no other combination beat the performance of the (Jenkins) one-at-a-time hash function erstwhile used by Perl; and the very simple, +1 probing mechanism.

Results:

Using a single, combined Perl hash to look up 11 million strings from their associated 64-bit integer address, and vice versa, takes a total of 15.9 seconds and requires 4.17 GB:

C:\test>1112165-hash
Start: mem: 7,920 K
23762752
Hash built: mem: 4,382,828 K
Fetch by Name took 7.539046 seconds
Fetch by Addr took 8.349479 seconds
Hash queried: check mem: 4,382,852 K
[download]

Using BiMap on the same dataset using a table presized to accomodate 16 million pairs, and a 0.71 fill factor. takes 22.3 seconds and requires just over 1/2GB:

C:\test>BiMap -S=5 -F=0.71 -I=12000000 -D=0
bless(do{\(my $o = 52848976)}, "BiMap")

Fetch by Addr & Fetch by Name took 22.257803 seconds for 11881376 item
+s in a 16777216 sized BiMap
Mem: 580,552 K
[download]

By moving to a fill factor of 0.70 and pre-sizing the arrays to accommodate 33 million pairs, takes 19.8 seconds and requires just over 3/4GB:

C:\test>BiMap -S=5 -F=0.70 -I=20000000 -D=0
bless(do{\(my $o = 52848976)}, "BiMap")

Fetch by Addr & Fetch by Name took 19.896479 seconds for 11881376 item
+s in a 33554432 sized BiMap
Mem: 777,532 K
[download]

Conclusion:

Using a BiMap in preference to a Perl hash will allow 8 times as much data to be processed per run; thus both reducing the number of runs required whilst increasing the statistical validity of the results by well over 8 times.

Size for size, the total runtime of the tests will increase by approximately 40%. A fair penalty for the substantial increase in statistical validity.

Other options:

It seems possible that Judy arrays would reduce the space requirement further and (possibly) increase the performance. Just a shame that the Build mechanism is broken.
I'm also considering trying a minimal radix tree/Trie implementation to see how it compares. It won't achieve the space reduction of a Judy Array, but as it automatically 'compresses' any prefix overlap in the key sets, it might be competitive and should be fast as it avoid both hashing and some indirection.
Perfect hashing/Minimal perfect hashing seems like it might both reduce the space requirement and speed up the lookups.
However, generating such functions is non-trivial if you start from scratch, but life's too short to try extract useful information from the SourceForge website for the CMPH project.
The examples suggest that it will only take the inputs from disk; which is a pain as for this case, the PAIRs are generated entirely in memory and writing them out to disk, only to have them read back in, would negate any gains.
Of course, it could be that there is an api for supplying the keys directly from memory, but as there doesn't appear to be any API documentation; nor a way to view the individual files via SF, who knows.
In addition, I work in the Windows world and experience tells me that it would probably take longer to work out how to build the library on Windows than to re-write it from scratch. All of these OSS projects seem to have interminable dependencies upon autoconf or cmake or any of a dozen other variations of build tools that either don't work on windows, or if they do, themselves have half a dozen more dependencies some or all of which that don't.
Other/"better" hashing functions and/or probing algorithms.
Having tried several of each, there seems to be no particular benefit for my particular application data, without moving to separate chaining, which incurs the dual penalties of extra complexity and extra memory; taking things back in the direction of the well-tuned generality of the Perl implementation.

The code:

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

In reply to Re: Bidirectional lookup algorithm? (Solution.) by BrowserUk
in thread Bidirectional lookup algorithm? (Updated: further info.) by BrowserUk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Come for the quick hacks, stay for the epiphanies.
	PerlMonks