Re: A (memory) poor man's hash
by tilly (Archbishop) on Nov 21, 2003 at 19:35 UTC
|
First an important correction. The amount of memory taken to store a hash varies widely depending on how you do it. On 5.8, initializing a large hash with undef $foo{$_} for 1..1_000_000; takes about 69 MB. Initializing with undef @foo{1..1_000_000}; runs faster but takes about 130 MB. (This is run on Perl 5.8.1 on Linux, YMMV.) I'm guessing that the difference is how much buffering I am doing.
The second tip. If you are doing to use dbm with a lot of data, always, always, always use a B_TREE. On a faster machine than yours (about a factor of 10), my insert time was 12 seconds. So that is 2 minutes. See the docs for a better overview and if you need to tune it more, see here in addition.
Now I'll grant you that walking keys close to in order is pretty much a best case scenario, but very few real-world scenarios don't have considerable locality of reference, so B_TREE accesses tend to win big on datasets of under a few hundred million.
And this leads to a general note. Hashes are probably the fastest general-purpose access method when you have a flat memory model. And they serve Perl very well. But the real world is moving more and more towards multiple tiers of data storage with very different speeds between them. (In decreasing order of speed: on CPU we have level 1, 2, and 3 caches, followed by main RAM, NUMA adds a few layers here, then we have local hard drives, attached storage devices, network drives, then stuff remote to the LAN.) Given differentials in Moore's law, this trend is likely to continue growing in importance as the ratios between the speeds of these layers get more and more extreme, making alternatives to hashing make more and more sense. | [reply] [d/l] [select] |
|
First an important correction. The amount of memory taken to store a hash varies widely depending on how you do it....
The difference in the memory consumed by the 2 methods you show is that with undef @foo{ 1 .. 1_000_000 ];, the extra memory is consumed by generating the list 1 .. 1_000_000 prior to generating the hash slice.
When you use ...for 1 .. 1_000_000;, no list is generated as for has an optimisation to allow integer ranges to act as iterators rather than list generators.
The method I used was
$hash{ $_ } = undef for 1 .. 1_000_000;
but on my machine,
undef $foo{ $_ } for 1 .. 1_000_000;
results in an identical memory consumption. Prior to the allocation of the hash, the perl process has 3332k allocated to it. After the allocation, it has 98184k allocated. Hence my assertion that the memory required is 95 MB. I can't explain the difference between the 95 MB I am seeing and the 69 MB you reported.
I tried using Devel::Size to verify the memory allocation the OS is reporting is actually being used by the hash rather than some portion of it being intermediate working storage that would be available to the process for other purposes afterwards, but calling either size( \%hash ); or total_size( \%hash ); both doubled the memory consumed by the process, pushing it beyond both my physical and virtual memory capacity before segfaulting.
Using DB_TREE rather than DB_HASH does speed up the insertion considerably.
use DB_File;
tie %hash, 'DB_File', 'tree1M.db'
, O_RDWR|O_CREAT, 0666, $DB_BTREE or die $!;
print time; $hash{ $_ } = undef for 1 .. 1_000_000; print time;
1069482082
1069482433 # 351 secs (5.85 mins)
$n=0;
print time; exists $hash{ $_ } and $n++ for 1 .. 1_000_000; print time
+, $/, $n;
1069482523
1069482799 # 276 secs (4.6 mins)
1000000
It reduces the insertion time from an hour to 5 minutes and the iteration time from 15 minutes to 4 at the cost of increasing the size of the database file from roughly twice the flat file size to 4x.
If you are only using the hash for it's fast "does it exist" property, then using the custom hash mechanism I described contrasts favourably with the insertion taking 16 seconds and the iteration 44.
print time;
$hash{ substr $_, 0, 4 } .= $/. $_ for 1 .. 1_000_000; $hash{$_} .= $/
+ for keys %hash;
print time;
1069484550
1069484566 # 16 seconds.
$n=0;
print time;
for ( 1 .. 1_000_000 ) {
next unless exists $hash{ substr $_, 0, 4 }
and 1+index( $hash{ substr $_, 0, 4 }, $/ . $_ . $/ );
$n++;
}
print time, $/, $n;
1069484612
1069484656 # 44 secs.
1000000
With respect to caching. I was recently playing with prime number generators/testers and came across some discussion about techniques for optimising code (C and assembler) to make best use of L1 and L2 caching. I'll admit that much of the discussion was over my head, but my conclusion was that compilers will have to become much clever than they currently are before general code will fully benefit from these.
The current trend to making caches larger and larger will mitigate this somewhat, but the depth of understanding required to fully utilise these caches and tailor algorithms to them is so great, and so processor specific, that I doubt that it will ever be more than a niche exercise.
Whether processors will ever get to the point where the microcode is capable of utilising out-of-order execution, re-ordering the pipelines, or re-writing the instruction caches to optimise cache hit rates at anything beyond the keyhole scale I'm not sure. Given that the disposable processor you get in musical birthday cards has more processing power in its 4mm2 of silicon that ENIAC did in its 200k watt consuming 30 tons, who knows what a "computer" will be capable of in another 60 years :)
It's just a shame I won't be around to see it.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!
| [reply] [d/l] [select] |
|
I thought of the overhead of building the list, but the difference was more than I would think that that takes into account. But I did not measure it. However I just used Devel::Size and it reported 32083260 as the size of the hash and 44083260 as the total_size. (Regardless of how I built that hash.) My guess is that the rest is extra buffering on the part of various memory requests. Part of which is going to be the OS buffering Perl's mallocs, which is likely to be highly OS dependent, and possibly order of request dependent as well.
On the implementation, note that the performance can be improved both by moving from tied data structures to OO access and again (probably more) by using documented tuning parameters. Additionally with a dataset this small, it may be acceptable to switch back to a hash stored in memory (since BerkeleyDB's hashing takes less memory than Perl's). Furthermore the fact that it is general-purpose makes me more willing to take that approach rather than constructing a special-purpose solution that is possibly fragile.
And on caching. I agree heartily that optimizing for perfection is only going to be a niche exercise. I disagree that optimizing for better use of multi-level memory is unimportant though. I predict that you will see programmers over time being pushed towards data structures that are readily tuned to local memory configurations, dynamically if possible. You could view the popularity of databases as a step in this direction. A far bigger step is Judy arrays. Sure, it is very complicated to implement, but you only need to implement once. (Well, Perl would need to reimplement for licensing reasons...) And from the end programmer perspective, it is just a question of replacing the black box of a hash by the black box of another associative array implementation and voila! Things run faster!
A few more years of increasing discrepancies between access times, and the difference will become bigger still. Eventually it will be big enough that common scripting languges will make the switch. Most programmers will never need to understand what actually happened, any more than we really understand most of the ongoing changes that make Moore's law tick in the first place.
| [reply] |
|
|
|
Re: A (memory) poor man's hash
by PetaMem (Priest) on Nov 21, 2003 at 19:00 UTC
|
Hi,
Congratulations, you just invented a kind of a Trie.
Besides that, you speak of a very serious problem (at least for us). And we're not that poor on memory. But Perl forces us to migrate some applications on 32GB RAM machines. Alas,
they're 64bit and Perl is much more of a memory hog there than it is on 32bit. *sigh*
A less memory-hoggy Perl is on my wishlist for 5.10 and compare also this node of mine.
This problem has only one solution. A better implementation of hashes within perl - or at least some option where one can choose whether he'd like to trade memory for speed.
Again - I hope to see it in 5.10.
| [reply] |
|
Yeah it appears to be a hybrid Trie implementation. Which I personally find extremely amusing. The primary drawback with using Trie's is that they are typically terribly wasteful of memory. (Except if the data set is extremely dense.) Which is why it turns out that most times Tries are implemented they are implemented in a hybrid form, or are designed to work on a limited data set or both. OTOH one advantage of a pure Trie lookup is that a C implementation of the algorithm performs similarly or better than a hash lookup but does prefix matching as well.
---
demerphq
First they ignore you, then they laugh at you, then they fight you, then you win.
-- Gandhi
| [reply] [d/l] |
•Re: A (memory) poor man's hash
by merlyn (Sage) on Nov 21, 2003 at 17:52 UTC
|
However, loading the same 1 million integers into a hash as keys, with undef as the value requires 95 MB!
I stopped reading there. I don't see your point. Besides storing all the data, you now have a meta-data structure that can tell you rather rapidly if $x is a member of this set you've created, as well as associate another scalar with each of those million keys!
You've got a lot more information than what you started with. You're not merely storing the keys.
If your complaint is that you want to be able to just store the keys, then yes, a hash was a bad choice, as you go on to point out.
But don't fault Perl's hash structure. It's very efficient for the task at hand.
| [reply] |
|
I've changed the title of the post to more accurately reflect the intent of the post.
I thought my opening paragraph(s) made it very clear that I have no "complaint", nor was I " fault<ing> Perl's hash structure" for the vast majority of applications.
The only application I was suggesting this would be useful for is the "... perhaps the most oft-used use of hashes is as fast lookup tables..." as exampled by the referenced post Memory Management Problem.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!
| [reply] |
Re: A (memory) poor man's <strike>hash</strike> lookup table.
by cees (Curate) on Nov 22, 2003 at 03:10 UTC
|
There appears to be a bug in your code which is masked by the data you chose and the test you made. Your search only matches the start of the key, when it should match the end as well. For example with your data, a search for the key 12345 will look in bin 1234 for a string matching $/.'12345', but it could find a string like 1234567 and still return a match.
Also, for interest sake, you can reduce the memory footprint further by stripping out the bucket index from each of the keys. Using the above example again, if we wanted to store 1234567, we can just store 567 in the bucket keyed by 1234. The trivial case of storing 1234 will be stored as an empty string, which will be found correctly by searching for $/$/.
The following code reduced the footprint by another couple of megs. This reduction should also speed up the searching somewhat.
#!/usr/bin/perl
$hash{ substr $_, 0, 4 } .= $/ . substr($_, 4) for 1 .. 1000000;
$hash{$_} .= $/ foreach keys %hash;
$n=0;
print time, $/;
for (1 .. 1000000) {
next unless exists $hash{ substr $_, 0, 4 }
and 1+index( $hash{ substr $_, 0, 4 }, $/.substr($_, 4).$/ );
$n++;
}
print time, $/, $n;
Other than that, it is an interesting concept that I can't see myself ever using :)
- Cees | [reply] [d/l] |
|
Indeed you are correct, my original code is flawed as you describe. Thanks for pointing out my error Cees++
I've re-run a subset of the tests using the following slightly modfied version which corrects this error and the correction slows the insertion time by approximately 1 second for the million keys. The iteration time comes out exactly the same.
print time;
$hash{ substr $_, 0, 4 } .= $/. $_ for 1 .. 1_000_000;
$hash{$_} .= $/ for keys %hash; # Append a final separator
print time;
1069484550
1069484566 # 16 seconds.
$n=0;
print time;
for ( 1 .. 1_000_000 ) {
# Ensure no partial matches by prepending and appending
# separators to the search key.
next unless exists $hash{ substr $_, 0, 4 }
and 1+index( $hash{ substr $_, 0, 4 }, $/.$_.$/ );
$n++;
}
print time, $/, $n;
1069484612
1069484656 # 44 secs.
1000000
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!
| [reply] [d/l] |
Re: A (memory) poor man's <strike>hash</strike> lookup table.
by TimToady (Parson) on Nov 22, 2003 at 00:07 UTC
|
For another approach, look at Tie::SubstrHash. You'd have to recast your problem as keys of constant length though. Still, it'd be interesting to see a timing comparison. | [reply] |
|
use Tie::SubstrHash;
tie %hash, 'Tie::SubstrHash', 7, 1, 1_000_000;
print time;
$hash{ pack 'A7', $_ } = ' ' for 1 .. 1_000_000;
print time;
1069489311
1069489911
$n = 0;
print time;
$n += defined $hash{ pack 'A7', $_ } for 1 .. 1_000_000; print time, $
+/, $n;
1069490448
1069491023
1000000
I think that it may be possible to improve this a little by updating the module. It uses local several places where my would probably be quicker. Though, I suspect that the main performance hit is the cost of tie rather than the implementation. Part of my reasoning for liking the mechanism I described was that it was simple enough in it's implementation that it could be coded directly avoiding the need for tie.
I also think that it would be possible to improve the user interface by having the module perform any padding required rather than the caller,
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!
| [reply] [d/l] |
Re: A (memory) poor man's hash
by Anonymous Monk on Nov 21, 2003 at 17:55 UTC
|
Maybe I'm being brain-dead here (wouldn't be the first time), but you
start off with a hash of a million integer keys, but you end up with
something much less useful than a hash. No associated values, and no
guarantee of a million unique "keys" unless you use the two-pronged test
during insertion. It may or may not function as a memory efficient "set",
but it isn't a hash (not even a poor man's hash).
| [reply] |
Re: A (memory) poor man's hash
by Paulster2 (Priest) on Nov 21, 2003 at 18:20 UTC
|
Two things:
First: They are simple to use Being the newbie that I am, I still find hashes a little daunting. I understand what they are and basically how they are used, but implimenting them is another story. I have successfully modified others writs and made some of my own simple ones. I guess getting the info in is a lot easier than getting it out. So while others may feel that they are easy to use, I haven't gotten there yet. In other words, I guess that all of this comes down to opinion. I know that hashes are the life blood of PERL so I guess I better start figuring it out.
Second: While hashes may have a tendancy to soak up a lot of memory, it's not using it for that long. I haven't done the math, but I would bet if you put a dollar amount on the time wasted using other methods vs. the amount of bog that you may incur to other programs at the same time on any given system, you would find you have saved tons of money using hashes. How many clock cycles do you waste waiting on inefficient languages to do their thing?
Bottom line to my statement is by looking at the big picture, you may be surprised!
Paulster2
PS: I ++ you anyway, even though I don't agree. Mainly because I found it a stimulating writ.
| [reply] |
|
I still find hashes a little daunting.
BrowserUK (if I may presume) was speaking in relative terms. Hashes are quite easy compared to (for example) Perl's advanced data structures, object system, or closures.
While hashes may have a tendancy to soak up a lot of memory, it's not using it for that long.
The problem is that perl usually doesn't free the memory back to the OS until the process exits (though it will reuse that memory for other things). This is particularly a problem for mod_perl, where the process will stick around as long as Apache is up (potentially years, if your sysadmin is neglegent about security patches ;).
---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
-- Schemer
: () { :|:& };:
Note: All code is untested, unless otherwise stated
| [reply] [d/l] |
|
...(though it will reuse that memory for other things). This is particularly a problem for mod_perl, where the process...
One approach that I always use when a request is going to be quite resource consuming, is to let the child process die after the request is served. You can do this with $r->child_terminate. The extra CPU for forking a new child process is a lot less than the CPU you would lose when the server starts swapping.
Liz
| [reply] |
|
| [reply] |
Re: A (memory) poor man's hash
by Anonymous Monk on Nov 21, 2003 at 18:43 UTC
|
Where is your benchmark code?
| [reply] |
Re: A (memory) poor man's <strike>hash</strike> lookup table.
by demerphq (Chancellor) on Dec 01, 2003 at 13:20 UTC
|
Hi.
Over the weekend I did a little work on this. I loaded your million numbers into a digit trie (a Patricia Trie restricted to the digits 0-9) optimized for reduced memory consumption. The records loaded in around 100 seconds on a 2.2 gighz machine (if temporary space is not an issue this time drops dramatically), and ultimately required less than 5MB to store. I calculated the ratio of bytes used to store the data versus the sum of lengths of the keys and it was 0.62, meaning that the data structure requires 0.62 bytes to store one byte of key. Lookup time in the structure is comparable (or better) than a raw hash (I will admit the implementation of the lookup is done in Inline::C/XS).
Note that this Trie implementation may not be optimal. Because it stores the keys as digits strings instead of byte encoding the integer and then doing a byte-trie its probably less efficient a trie implemented as the latter. You could probably calculate the storage requirements but I have to admit I cant be bothered, however I will hazard a guesstimate that it would be around 2.5MB. Also this structure would be even faster to do lookups on (the issue of packing the integers as necessary before lookup aside).
Also, these calculations are _purely_ about storing the keys. Not about storing the key along with a value. Assuming this was required and the values to be stored were long ints you could add 4MB to both numbers (storing them as packed longs). This would mean the digit trie required total 8.5 mb and the byte trie would come in at 6.5MB. (Actually im not bothering to calculate the size of an SV here. Presumably you would need a small amount of ram, say (a very generous) 1k for the object that holds the two char arrays.)
Which actually brings up a different point. Instead of storing your million numbers in an array you could store them in a packed string. This would mean the total memory requirement would be 4mb, and would still facilitate binsearching. Finger points into the data may speed that search up drammatically.
The discussion with Judy arrays led me to rereview the subject. Judy arrays are going to be slightly less efficient than a pure byte-tree in terms of storage. I think it would be safe to say that for data packed this densely the minimum bound for memory utilization (without bringing compression into the picture) is going to be the numbers you get from a Trie. However its worth noting that this scenario is realtively unrealistic. With lower density data sets my understanding from what ive read on Judy arrays would suggest they would be much more efficient.
I wanted to say that I found this thread thought provoking and interesting and I hope you post more along its line. The subject of optimisation techniques is an interesting one that is IMO useful and worth discussing. All the old caveats about optimizing are correct, but the fact comes that once you identified an optimization target having discussion and documentation of ways to handle it can be very useful.
PS: Its worth noting that conceptually a Trie is essentially a non cpaturing DFA regex engine anchored at the beginning of the string. Its not difficult to extend a trie to handle Kleene closure (*) and one-or-more (+), and only slighlty more difficult to make it handle non anchored (^) searches as well. The advantage of this is that its more efficient for matching many constant strings than the NFA regex engine that perl uses. (Although of course with serious deficiencies.) Ive long wished that perl handled this internally, allowing for a modifier to cause a regex pattern to be treated as a DFA an not as an NFA, which would essentially mean putting a trie engine into perl.
Its weird that these things are related (who would guess there is a direct family tree relationship between regular expressions and judy arrays!) In fact you could take this point further and argue that really there is a relationship between linked lists (degenrate binary trees) and regular expressions! Woo-hoo! Talk about 6 degrees of seperation. ;-)
Cheers,
---
demerphq
First they ignore you, then they laugh at you, then they fight you, then you win.
-- Gandhi
• Update:
Bart suggested breaking the paragraphs up, and pointed out some typos. Thanks Bart.
-->
| [reply] [d/l] |
|
Hi demerphq.
Sorry for taking so long to get back to you but I had to find out what Tries and Patricia Tries were before I could understand your post and appreciate the effort that you had expended in researching it. Thanks for taking the idea seriously.
It turns out that I had actually implemented something similar a few years ago for a Random Testcase generator -- though I didn't know they were called Tries. The API set being tested consisted of around 1000 functions, grouped into several sub-apis with each function having a 3-character prefix (Prf..., Win..., Gpi, Dev... etc. *) designating its group. The verbs and adjectives that made up the rest of the names formed a fairly small set, so by substituting a numeric value for each of these and the prefixes, and using these numeric values to build a tree, the storage requirements and the search/traversal times were hugely reduced. This was back in the days when 8 MB machines were big, 16 MB machines almost unheard of and 20 Mhz CPU's were top of the range. We needed to conserve memory, and use smart algorithms rather than brute force, in order that things would run in a reasonable time.
*They may be familiar to anyone who programmed OS/2.
I'll make no claims to be fully conversant with Tries (of any flavour) until I've coded one myself from scratch at least 3 or 4 times, but I think that I've understood enough to draw the following conclusions.
In order for Tries to be space efficient, it requires that the domain of the keys be fairly fully utilised. My example of asciized integers lends itself to this rather nicely resulting in a uniform distribution of the keys across the entire domain. However, if the characters making up the keys used the full 7 or 8-bit range of ascii characters, then one million keys would likely result in a sparsely populated tree, and the amount of unused space in the tree would probably outweight the used space by a fairly large factor. If the keys are Unicode, the problem gets much worse very quickly.
For example. If the keys consist of the 96 non-control characters in the 7-bit ascii set, and if we had 1 million, 4 character keys, the domain space would potentially be 84,934,656 giving a utilisation of 1/84th of the total domain space. And that's the problem with any data structure that utilises fixed size tables.
I realise that most Trie implementations, in common with Judy arrays, utilise various, more or less complicated, compression mechanisms to avoid such wastage, but these impose there own limitations and burdens. Judy arrays use a very complicated 6-level compression mechanism. The bit twiddling, branching and special casing required in these schemes means that you couldn't efficiently implement them at the Perl level. If I've understood them correctly, writing an implementation of Tries such that it was either generalised enough to handle this, or could be configured at runtime by passing a parameter would be extremely difficult.
The nice thing about the mechanism I outlined is that it is quite easily tailored to different domains, by choosing how many (and which) subset of the keys are used for the initial indexing. In the version I currently have coded, this is an unpack format that is used to break the supplied key into 2 or 3 parts. Using cees insight at Re: A (memory) poor man's <strike>hash</strike> lookup table. of removing the index from the key prior to storage reduces the memory consumption further from my original figures. It also speeds up the linear search part as he predicted. For the example million integers, the tie statement looks like this
tie my %hash, 'Tie::CompactHash', 'A0 A4 A*';
The 'A0' part indicates that there is no common prefix that can be factored out. The 'A4' indicates that 4 characters (in this case digits) are used as the primary index spreading the 1 million keys over 10,000 scalars (in a real hash). The final 'A*' indicating that the rest of the key is stored within the strings for linear searching. Also stored in the same string is an index into a second string indicating where the value for this key is stored. The values are stored using the 'n/A*' pack format, storing the length and value of the value. This allows both the keys and values to be of variable length, as is the overall size of the hash. A special number is used as the index to indicate that key has no value.
For the 1 million 4x 7-bit character keys, it would look like this
tie my %hash, 'Tie::CompactHash', 'A0 A2 A*';
Here, again there is no common prefix, the index uses 2 characters mapping the million characters over 9216 (96²) strings, with the other two characters stored in the string along with the index into the values string.
And a final example based on the post I cited in my original post. The problem involved hashing log filenames, which I speculated are often named along the lines of XYZ20031201.log.
tie my %logs, 'Tie::CompactHash', 'A3 xx A4 A2 A*';
In this case,
- 'A3' is a prefix common to all the log files. This is extracted from the first key passed and stored internally. For subsequent keys, this value is verified against the stored value, but is otherwise discarded.
- 'xx' skips over and discards the century. For most applications, ignoring this won't harm.
- 'A4' are the low 2 digits of the year, plus the month. For 10 years of logs, this maps to 120 strings.
- 'A2' is the day of month ensuring a maximum of 31 subkeys in each string.
- 'A*' is a common postfix which is discarded.
This assumes that the filenames matched will be preselected (using glob or readdir etc) so that those parts ignored will not be significant. This is done to show the flexibility of the mechanism. Internally, the format passed on the tie is stored and then used to break up the key on input. This is efficient and flexible. For example, the exists function is implementated as
sub EXISTS {
my( $pre, $key, $post ) = unpack $mask{ $_[SELF] }, $_[KEY];
die "Prefix mismatch $pre ne $pre{ $_[SELF] }"
unless $pre eq $pre{ $_[SELF] };
return
defined $_[SELF]{ $key }
and $_[SELF]{ $key }[ 0 ] =~ m[(...\c@)\Q$post\E\xff]s;
}
This use of perls standard hash mechanism to handle the sparse arrays problem and regexes for the final linear search, as both are built-in and heavily optimised means that once invoked, they run at the full speed of a C solution.
I might try coding it in C or possibly assembler and see how it compares for performance, but the major limitation would still the tie interface. I've also tried coding my perl version using an OO interface to see what if any performance difference that made -- it was much of a muchness -- but the syntax of the interface was much less convenient than the tied version. I wish the tie mechanism performed better. It lends itself to so many useful extensions of perls built-in types, it's a shame the costs are so high.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!
| [reply] [d/l] [select] |