constructing large hashes

duelafn has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: constructing large hashes by jsprat (Curate) on Sep 30, 2002 at 21:18 UTC
You may (or may not) find enlightenment here. Perl's hashing algorithm may be coming up with the same hash for each key. After the hash is constructed, try `print scalar %arrangements;` to see how many buckets are used/allocated for the hash. Update: Take a look at this: `#!/usr/bin/perl use strict; use warnings; use Algorithm::FastPermute; #uncomment the following line and comment out the next to see the diff #my @array = (0 .. 7); my @array = (100_000 .. 100_007); my ($key, %perm); permute {$key = join '',@array;$perm{$key} = 1} @array; print scalar %perm;` [download] Using 0..7, the code runs very quickly and uses 3704 out of 8192 buckets for 40_320 values. However, using 100_000..100_007 it uses 4 out of 8 buckets (for the same number of values) and runs very slowly - effectively turning perl's fast hashing algorithm into an expensive linear search.	[reply] [d/l] [select]
Re: Re: constructing large hashes by LEFant (Scribe) on Oct 01, 2002 at 01:36 UTC
I agree that determining that the set of keys is not perverse is good first step in understanding poor hash performance. We assume that you asked the more important question...is the hash is the most effective way to the solve your problem. After that, some degree of optimization can be obtained by pre-sizing the hash. Perl may be reorganizing the hash at one or more points in populating it. If you can calculate the number of keys Perl provides a way for you to reveal the number of keys you will insert. Use keys(%hash_name) = number to provide that estimate. You may want to use a number smaller or larger than the number of keys depending on how Perl uses this hint. (Read the source or experiment?)	[reply]
Re: Re: Re: constructing large hashes by jsprat (Curate) on Oct 01, 2002 at 01:50 UTC
++LEFant Cool! At first I didn't think pre-sizing the hash would make any difference in the number of buckets or how they were used - but it sure does.	[reply]
Re: Re: constructing large hashes by blssu (Pilgrim) on Sep 30, 2002 at 21:55 UTC
That's an interesting thought -- even though all the hash keys are different, there may be many collisions because the number of different characters is small. Does anybody know if perl's hash function is sensitive to collisions on permutations? For example, will things like "123", "213" and "312" hash far apart? If the character position is completely ignored, those will collide. I'm certain perl is not that bad, but 8 character permutations put a lot of stress on the hash function.	[reply]
Re: Re: Re: constructing large hashes by BazB (Priest) on Sep 30, 2002 at 22:03 UTC
I've used a hash with around 26 million key/value pairs in the past. The keys were incrementing digits, so there would have been many cases similar to your 123, 213, 312 scenario, but it didn't appear to cause any problems. That was Perl-5.6.1 on Solaris.	[reply]
Re: Re: constructing large hashes by duelafn (Parson) on Oct 01, 2002 at 00:20 UTC
Ah Ha! Of course. I guess that I should have thought of that. Anyway, after trying several different separators (',', ' ', '_', '-', '.', '0') and even trying padding with zeroes (01, 02, ..., 08), the only solution that works well is no separator which makes it more difficult when we move to permutations of 10 or more things. Although from blssu's results below it looks as though some of this was fixed in perl 5.8 (I'm still limping along at 5.6.1). Thanks! Dean If we didn't reinvent the wheel, we wouldn't have rollerblades.	[reply]
Re: Re: Re: constructing large hashes by jsprat (Curate) on Oct 01, 2002 at 01:08 UTC
I had another thought on the way home from work. You don't need the temp array, just build the hash in the code block for permute! `permute { $temp = join(',',@n); $arrangement{$temp} = undef;} @n; # create the hash here --^^^^` [download] That should save some memory - and on my system, about a minute. As far as the separators, maybe pack the permutation into the key? ~~`permute { $arrangements{pack "C*", @n} =1 } @n;`~~ ~~is almost instantaneous.~~ Won't help with 10 or more...	[reply] [d/l] [select]
Re: Re: Re: constructing large hashes by thor (Priest) on Oct 02, 2002 at 11:23 UTC
Ah Ha! Of course. I guess that I should have thought of that. Anyway, after trying several different separators (',', ' ', '_', '-', '.', '0') and even trying padding with zeroes (01, 02, ..., 08), the only solution that works well is no separator which makes it more difficult when we move to permutations of 10 or more things. Although from blssu's results below it looks as though some of this was fixed in perl 5.8 (I'm still limping along at 5.6.1). As I'm now (attempting to) learn C, maybe this is obvious to me because it's a common idiom in C: use letters instead. You can use the `ord` function to get a mapping between the letters and the numbers, and you can even normalize the set `[a-z] -> [0-25]` by subtracting `ord("a")`. Also, you wouldn't have to change much else (from what I can see, anyways). For instance: `$n = 8; @n = (1..$n);` [download] becomes `$n = ord("a")+8; @n = ("a"..$n);` [download] Of course, this all assumes that you don't do many explicit aritmetic calculations on your set of permutations. If you do, the repeated calls to `ord` and `char` my be prohibitive. thor	[reply] [d/l] [select]
Re: constructing large hashes by blssu (Pilgrim) on Sep 30, 2002 at 21:41 UTC
Curious. Are you sure your benchmark and real code are running under identical conditions? The code you posted should have similar behavior. If you are under severe memory or cpu pressure from other processes, that would explain the very long real times. (The actual cpu time elapsed can be much shorter than real time.) If this is a piece of some other program (daemon), have you checked the nice level? The only other thing I can think of is your optimization to pre-expand @temp. (The call to _fact is just an optimization, yes?) If you do that incorrectly your @temp array may be much larger than it is supposed to be. Have you checked the size of @temp? I've made some assumptions by looking at your code, and here is a complete example: `use strict; use warnings; use Algorithm::Permute qw(permute); my ($n,$i,@n,@temp,%arrangements); $n = 8; @n = (1..$n); $i = 0; $temp[&_fact($n)-1] = 1; # _fact := n! $\|++; permute { $temp[$i++] = join(',',@n) } @n; print STDERR scalar(localtime),$/; @arrangements{@temp} = undef; print STDERR scalar(localtime),$/; sub _fact { my($n) = @_; my $t = 1; while ($n > 1) { $t *= $n; --$n; } $t }` [download] Running on an Athlon 700, 256MB RAM with Linux 2.2, I see this output: `red.vulpes.com:~% perl test.pl Mon Sep 30 16:49:39 2002 Mon Sep 30 16:49:40 2002 red.vulpes.com:~% perl -v This is perl, v5.8.0 built for i686-linux-thread-multi ...` [download]	[reply] [d/l] [select]
Re: constructing large hashes by Aristotle (Chancellor) on Sep 30, 2002 at 23:20 UTC
The 70mb it slurps is due to the `@x` array alone; I added a `getc()` to be able to check. I also added a third method - ~~with interesting results~~. #!/usr/bin/perl -w use strict; use Benchmark qw(cmpthese); my @x = map join(',', split(//, rand(10000))), 0..80000; my %y; print "Check memory now.", $/; getc(); cmpthese(20, { map => sub { undef %y; %y = map(($_ => undef), @x) } for => sub { undef %y; @y{$_} = undef for @x; }, slice => sub { undef %y; @y{@x} = undef }, }); print scalar %y,$/; __END__ Check memory now. Benchmark: timing 20 iterations of for, map, slice... for: 16 wallclock secs (15.47 usr + 0.27 sys = 15.74 CPU) @ 1 +.27/s (n=20) map: 28 wallclock secs (26.67 usr + 0.30 sys = 26.97 CPU) @ 0 +.74/s (n=20) slice: 14 wallclock secs (13.54 usr + 0.24 sys = 13.78 CPU) @ 1 +.45/s (n=20) Rate map for slice map 0.742/s -- -42% -49% for 1.27/s 71% -- -12% slice 1.45/s 96% 14% -- 59853 [download] Update: had mixed up the `for` and `slice` labels. `for` doesn't win. Makeshifts last the longest.	[reply] [d/l]
Re: Re: constructing large hashes by duelafn (Parson) on Oct 01, 2002 at 00:26 UTC
Hmmm, after hearing the responses above, I bet that the memory differences come from both a) floats are larger than short strings and b) there is more structure involved in this hash, which is what is causing the faster lookup times. Also, you might want to correct the labels in your benchmarks, lest we be led astray! Thanks! Dean If we didn't reinvent the wheel, we wouldn't have rollerblades.	[reply]
Re^3: constructing large hashes by Aristotle (Chancellor) on Oct 01, 2002 at 00:35 UTC
What the.. I must have been half asleep already. Sorry. Fixed it, and now I'm off to bed. Thanks for the wrist slap. Makeshifts last the longest.	[reply]
Big Picture by PhiRatE (Monk) on Sep 30, 2002 at 23:12 UTC
I have a simple question, what on earth are you doing that could require a 4 million entry hash? the collisions on that in a 32bit space are not going to be pretty, and the build time will be silly. Unless you're doing a shitload of lookups its not going to be more efficient than calculating the entries on the fly, especially if your system is low on ram. More details on the actual problem would help :)	[reply]
Re: Big Picture by duelafn (Parson) on Oct 01, 2002 at 00:39 UTC
Why, I'm classifying codimension two hyperplane arrangements in C². What else would I be doing? Basically, I need to create all of the permutations of 1 to n, then for each one, apply a set of three rules which will give me a list of different permutations which shall be deemed "equivalent" to the current one. I give each of these a common tag and then move on. I do perform quite a few lookups to determine whether I have already applied the set of rules to a particular permutation through one of its equivalent representations. This is important since each of the rules can give me many equivalent permutations. At the end I perform one final clean up since sets of equivalent permutations may have been labeled with different tags. I make note of these occurrances during the first pass through though, so the final cleanup is trivial. Thanks for asking! (but I bet you're sorry you did) Dean Update: Sorry, we're in C² not C⁴ If we didn't reinvent the wheel, we wouldn't have rollerblades.	[reply]
Re: Re: Big Picture by PhiRatE (Monk) on Oct 01, 2002 at 23:11 UTC
Thanks for asking! (but I bet you're sorry you did) You're not wrong there :) :) Its situations like this where MathML or equivalent would be really nice, since you could just give me the forumlae :) In essence, what I understood was the following: For every permutation 1->n check every permutation 1->n to see if it is equivalent according to the rules. If it is, mark it as such. Results: a set of groups of permutations who are deemed equivalent. Are the rules transitive? My head generated psuedocode something like this (no efficiency or anything, just a general algorithm): `foreach $p_1 (permutation(1..n)) { if ($tags{$p_1}) { next; } $tags{$p_1} = $p_1; foreach $p_2 (permutation(1..n)) { if (rule1($p_1, $p_2) && rule2($p_1,$p_2) && rule3($p_1, $p_2)) + { $tags{$p_2} = $p_1; } } }` [download] But that only works if the rules are both transitive and symmetric (you hint that they are but I may be reading it wrong) I would be interested in the contents of the rules 1->3 if they're not going to make my head ache :)	[reply] [d/l]
Re: Re: Re: Big Picture by duelafn (Parson) on Oct 02, 2002 at 13:18 UTC
Re: Re: Re: Re: Big Picture by petral (Curate) on Oct 03, 2002 at 20:21 UTC


Syntactic Confectionery Delight
	PerlMonks