Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I have an application where I need to be able to map some strings to some integers and then be able to lookup the integers from their strings or vice versa. (NOTE: both strings and integers are unique!)

In Perl this can be (and currently is) achieved by using two hashes:

my %intByStr = populate(); my %strByInts; $strByInt{ intByStr{ $_ } } = $_ for keys %intByStr; ...

But the penalty for this is that every key and value is stored twice. Which means for a fairly modest dataset of just 450,000 key/value pairs, the total storage requirements are 95MB:

$i=1; $intByStr{ $_ } = $i++ for 'aaaa'..'zzzz';; $strByInt{ intByStr{ $_ } } = $_ for keys %intByStr;; print total_size( $_ ) for \(%intByStr, %strByInt);; 50805928 44297159

As my dataset can be somewhat larger, I'd like to minimise the space requirement. A modest improvement can be achieved by storing both mappings in a single hash:

undef %AbyB; $i=1; $AbyB{ $_ } = $i, $AbyB{ $i++ } = $_ for 'aaaa'..' +zzzz';; print total_size( \%AbyB );; 80479783

The potential space requirement for my dataset, which might grow to 10x (or 20x or more) the size of the examples above, is such that I'm looking for an alternative, even if it means coding the solution in (Inline::)C, but simply using Perl's hashes from C won't gain me anything.

I could use a more space efficient hash implementation, which I found a few of on the net; that usually come with a penalty of less flexibility which I could live with for this purpose.

Somewhere in the back of my mind I seem to recall a bidirectional lookup mechanism/algorithm that didn't require the duplicate storage, but for the life of me I cannot remember anything about how it worked; and my attempts to search for "bidirectional search algorithm" all turn up variations on this which is a shortest path between two node of a graph algorithm and thus not useful for me.

To the question:

Does anyone out there know/remember/have any pointers to "bidirection lookup algorithm" that only stores the values once?

Update per /msg request for further info:

The strings are usually fairly short, 1 to 8 characters, but sometimes can extend further upto 32 or occasionally more. (Think: typical variable names!)

The integers are (unique) 64-bit values. Think memory addresses for 64-bit processors.

The data is loaded once. The two lookup mechanisms (hashes/arrays/indexes/whatever) would be built once as, or after, the data is read; so build/insert speed is not a great concern.

The vast majority of the uses will be lookups, many millions, so the priority is the lookup speed -- preferably O(1) though O(logN) might be acceptable if the memory reduction was sufficient.

The data is discarded at the end of each run. There are no deletes.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Bidirectional lookup algorithm? (Updated: further info.) by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2024-04-23 16:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found