Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Small Hash a Gateway to Large Hash?

by LanX (Saint)
on Feb 18, 2014 at 00:32 UTC ( [id://1075245]=note: print w/replies, xml ) Need Help??


in reply to Small Hash a Gateway to Large Hash?

You shouldn't need to care about collisions, if the keys are random and not pathologically designed in a way to enforce collisions.

A hash normally just grows if there are too many entries. So still ~ O(1) !

IMHO your problem is not really speed but size. I once had the problem of a giant hash which was constantly swapping. So each access was limited by the speed of my hard-disk (ugly bottleneck).

I was able to solve that by transforming it into a a HoH and splitting the old key into two halves.

i.e.  $new{long}{_key} = $old{long_key}

this worked because I was able to make the algorithm run thru sorted keys, i.e. the "lower" hash needed to be loaded only once into RAM when the corresponding upper key long was processed.

This way Perl only needed to keep two hashes in memory. This is quite scalable...the only limitation is the size of your hard disk then.

So if you can manage your accesses (reads and writes) in a way that "long_keys" with the same "start" are bundled this is my recommendation (within the limited infos you gave).

If order doesn't matter this should be easy!

HTH! :)

Cheers Rolf

( addicted to the Perl Programming Language)

Replies are listed 'Best First'.
Re^2: Small Hash a Gateway to Large Hash?
by lsherwood (Beadle) on Feb 18, 2014 at 21:49 UTC
    Rolf-

    Your solution is intriguing. We don't see evidence of swapping to disk, but certainly is something to keep in mind.

      • No swapping means your hash is kept in RAM.

      • Noticeable collisions are very unlikely so accessing this hash can't be done faster.

      • You said the hash is only build once and start-up time is no problem, so writing and rehashing doesn't byte you.

      • You said you are reading and processing many files... thats the most likely place for optimization.

      But without profiling this is only shooting in the dark

      Could it be that you are parsing the files and checking if entries match against the hash?

      Then trie-optimized regexes could be very performant... (you might need runs with several regexes to avoid buffer overruns, but 10000 short strings should fit)

      But w/o much info that's again shooting in the dark.

      Cheers Rolf

      ( addicted to the Perl Programming Language)

        trie-optimized regexes could be very performant

        No, they couldn't. Follow your own advice and profile. The results will correct your mis-impression.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1075245]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2024-04-20 04:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found