http://qs321.pair.com?node_id=1112131

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

I have an application where I need to be able to map some strings to some integers and then be able to lookup the integers from their strings or vice versa. (NOTE: both strings and integers are unique!)

In Perl this can be (and currently is) achieved by using two hashes:

my %intByStr = populate(); my %strByInts; $strByInt{ intByStr{ $_ } } = $_ for keys %intByStr; ...

But the penalty for this is that every key and value is stored twice. Which means for a fairly modest dataset of just 450,000 key/value pairs, the total storage requirements are 95MB:

$i=1; $intByStr{ $_ } = $i++ for 'aaaa'..'zzzz';; $strByInt{ intByStr{ $_ } } = $_ for keys %intByStr;; print total_size( $_ ) for \(%intByStr, %strByInt);; 50805928 44297159

As my dataset can be somewhat larger, I'd like to minimise the space requirement. A modest improvement can be achieved by storing both mappings in a single hash:

undef %AbyB; $i=1; $AbyB{ $_ } = $i, $AbyB{ $i++ } = $_ for 'aaaa'..' +zzzz';; print total_size( \%AbyB );; 80479783

The potential space requirement for my dataset, which might grow to 10x (or 20x or more) the size of the examples above, is such that I'm looking for an alternative, even if it means coding the solution in (Inline::)C, but simply using Perl's hashes from C won't gain me anything.

I could use a more space efficient hash implementation, which I found a few of on the net; that usually come with a penalty of less flexibility which I could live with for this purpose.

Somewhere in the back of my mind I seem to recall a bidirectional lookup mechanism/algorithm that didn't require the duplicate storage, but for the life of me I cannot remember anything about how it worked; and my attempts to search for "bidirectional search algorithm" all turn up variations on this which is a shortest path between two node of a graph algorithm and thus not useful for me.

To the question:

Does anyone out there know/remember/have any pointers to "bidirection lookup algorithm" that only stores the values once?

Update per /msg request for further info:

The strings are usually fairly short, 1 to 8 characters, but sometimes can extend further upto 32 or occasionally more. (Think: typical variable names!)

The integers are (unique) 64-bit values. Think memory addresses for 64-bit processors.

The data is loaded once. The two lookup mechanisms (hashes/arrays/indexes/whatever) would be built once as, or after, the data is read; so build/insert speed is not a great concern.

The vast majority of the uses will be lookups, many millions, so the priority is the lookup speed -- preferably O(1) though O(logN) might be acceptable if the memory reduction was sufficient.

The data is discarded at the end of each run. There are no deletes.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.