Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

constructing large hashes

by duelafn (Parson)
on Sep 30, 2002 at 19:49 UTC ( [id://201813]=perlquestion: print w/replies, xml ) Need Help??

duelafn has asked for the wisdom of the Perl Monks concerning the following question:

Good Day, I'm trying to construct a large hash and have run into a surprising discrepancy between testing/benchmarking results and actual runtime results. I'm creating a hash of 40000 elements (I would like to do more (up to 4000000) if feasible) and wanted to do so as quickly as possible, so I ran a test using Benckmark.
use Benchmark 'cmpthese'; use strict; my @x = map join(',', split(//, rand(10000))), 0..80000; my %y; cmpthese(-2, {'@y{@x}' => sub { undef %y; @y{@x} = undef }, '%y = map' => sub { undef %y; %y = map(($_ => undef), @x) } } ); print 0+%y,$/;
which handily told me that @y{@x}=undef; is the way to go! (notice the s/iter scores.) -- the print statement is to verify that there are at least 40000 elements in %y (not guaranteed since we use rand).
s/iter %y = map @y{@x} %y = map 2.45 -- -49% @y{@x} 1.25 97% --
When I try to use this in my real code, it takes much longer. (I'm still using strict and warnings)
# ...snip my ($n,$i,@n,@temp,%arrangements); # get size and create all permutations (comma separated) $n = 8; @n = (1..$n); $i = 0; $temp[&_fact($n)-1] = 1; # _fact := n! $|++; permute { $temp[$i++] = join(',',@n) } @n; print STDERR scalar(localtime),$/; @arrangements{@temp} = undef; print STDERR scalar(localtime),$/;
So, from my tests I expected to see this snippet run rather quickly since the permute function is very fast for a list of 8 elements (< 1 sec). At the very least I expected the output from the two localtimes to be close together, but instead I got:
computing 8-arrangements... Mon Sep 30 15:30:31 2002 Mon Sep 30 15:33:35 2002
When I watch the processes in "top", the test code quickly slurps up 70M of memory, but my production code only uses 6M (and does actually work correctly). I'm not sure what the problem is and have run out of ideas for how to isolate the problem. Any suggestions would be greatly appreciated!

Thanks,
    Dean

P.S. I am actually fairly certain that Perl is the proper solution fo this problem since the rest of the program runs very quickly (tested using DProf) and would be a nightmare to write in any other language.


If we didn't reinvent the wheel, we wouldn't have rollerblades.

Replies are listed 'Best First'.
Re: constructing large hashes
by jsprat (Curate) on Sep 30, 2002 at 21:18 UTC
    You may (or may not) find enlightenment here.

    Perl's hashing algorithm may be coming up with the same hash for each key. After the hash is constructed, try print scalar %arrangements; to see how many buckets are used/allocated for the hash.

    Update: Take a look at this:

    #!/usr/bin/perl use strict; use warnings; use Algorithm::FastPermute; #uncomment the following line and comment out the next to see the diff #my @array = (0 .. 7); my @array = (100_000 .. 100_007); my ($key, %perm); permute {$key = join '',@array;$perm{$key} = 1} @array; print scalar %perm;

    Using 0..7, the code runs very quickly and uses 3704 out of 8192 buckets for 40_320 values. However, using 100_000..100_007 it uses 4 out of 8 buckets (for the same number of values) and runs very slowly - effectively turning perl's fast hashing algorithm into an expensive linear search.

      I agree that determining that the set of keys is not perverse is good first step in understanding poor hash performance. We assume that you asked the more important question...is the hash is the most effective way to the solve your problem. After that, some degree of optimization can be obtained by pre-sizing the hash. Perl may be reorganizing the hash at one or more points in populating it.

      If you can calculate the number of keys Perl provides a way for you to reveal the number of keys you will insert. Use keys(%hash_name) = number to provide that estimate. You may want to use a number smaller or larger than the number of keys depending on how Perl uses this hint. (Read the source or experiment?)

        ++LEFant Cool! At first I didn't think pre-sizing the hash would make any difference in the number of buckets or how they were used - but it sure does.

      That's an interesting thought -- even though all the hash keys are different, there may be many collisions because the number of different characters is small. Does anybody know if perl's hash function is sensitive to collisions on permutations?

      For example, will things like "123", "213" and "312" hash far apart? If the character position is completely ignored, those will collide. I'm certain perl is not that bad, but 8 character permutations put a lot of stress on the hash function.

        I've used a hash with around 26 million key/value pairs in the past.
        The keys were incrementing digits, so there would have been many cases similar to your 123, 213, 312 scenario, but it didn't appear to cause any problems.

        That was Perl-5.6.1 on Solaris.

      Ah Ha! Of course. I guess that I should have thought of that. Anyway, after trying several different separators (',', ' ', '_', '-', '.', '0') and even trying padding with zeroes (01, 02, ..., 08), the only solution that works well is no separator which makes it more difficult when we move to permutations of 10 or more things. Although from blssu's results below it looks as though some of this was fixed in perl 5.8 (I'm still limping along at 5.6.1).

      Thanks!
          Dean


      If we didn't reinvent the wheel, we wouldn't have rollerblades.

        I had another thought on the way home from work. You don't need the temp array, just build the hash in the code block for permute!

        permute { $temp = join(',',@n); $arrangement{$temp} = undef;} @n; # create the hash here --^^^^

        That should save some memory - and on my system, about a minute. As far as the separators, maybe pack the permutation into the key?

        permute {  $arrangements{pack "C*", @n} =1 } @n;

        is almost instantaneous. Won't help with 10 or more...

        Ah Ha! Of course. I guess that I should have thought of that. Anyway, after trying several different separators (',', ' ', '_', '-', '.', '0') and even trying padding with zeroes (01, 02, ..., 08), the only solution that works well is no separator which makes it more difficult when we move to permutations of 10 or more things. Although from blssu's results below it looks as though some of this was fixed in perl 5.8 (I'm still limping along at 5.6.1).
        As I'm now (attempting to) learn C, maybe this is obvious to me because it's a common idiom in C: use letters instead. You can use the ord function to get a mapping between the letters and the numbers, and you can even normalize the set [a-z] -> [0-25] by subtracting ord("a"). Also, you wouldn't have to change much else (from what I can see, anyways). For instance:
        $n = 8; @n = (1..$n);
        becomes
        $n = ord("a")+8; @n = ("a"..$n);
        Of course, this all assumes that you don't do many explicit aritmetic calculations on your set of permutations. If you do, the repeated calls to ord and char my be prohibitive.

        thor

Re: constructing large hashes
by blssu (Pilgrim) on Sep 30, 2002 at 21:41 UTC

    Curious. Are you sure your benchmark and real code are running under identical conditions? The code you posted should have similar behavior. If you are under severe memory or cpu pressure from other processes, that would explain the very long real times. (The actual cpu time elapsed can be much shorter than real time.) If this is a piece of some other program (daemon), have you checked the nice level?

    The only other thing I can think of is your optimization to pre-expand @temp. (The call to _fact is just an optimization, yes?) If you do that incorrectly your @temp array may be much larger than it is supposed to be. Have you checked the size of @temp?

    I've made some assumptions by looking at your code, and here is a complete example:

    use strict; use warnings; use Algorithm::Permute qw(permute); my ($n,$i,@n,@temp,%arrangements); $n = 8; @n = (1..$n); $i = 0; $temp[&_fact($n)-1] = 1; # _fact := n! $|++; permute { $temp[$i++] = join(',',@n) } @n; print STDERR scalar(localtime),$/; @arrangements{@temp} = undef; print STDERR scalar(localtime),$/; sub _fact { my($n) = @_; my $t = 1; while ($n > 1) { $t *= $n; --$n; } $t }

    Running on an Athlon 700, 256MB RAM with Linux 2.2, I see this output:

    red.vulpes.com:~% perl test.pl Mon Sep 30 16:49:39 2002 Mon Sep 30 16:49:40 2002 red.vulpes.com:~% perl -v This is perl, v5.8.0 built for i686-linux-thread-multi ...

Re: constructing large hashes
by Aristotle (Chancellor) on Sep 30, 2002 at 23:20 UTC
    The 70mb it slurps is due to the @x array alone; I added a getc() to be able to check. I also added a third method - with interesting results.
    #!/usr/bin/perl -w use strict; use Benchmark qw(cmpthese); my @x = map join(',', split(//, rand(10000))), 0..80000; my %y; print "Check memory now.", $/; getc(); cmpthese(20, { map => sub { undef %y; %y = map(($_ => undef), @x) } for => sub { undef %y; @y{$_} = undef for @x; }, slice => sub { undef %y; @y{@x} = undef }, }); print scalar %y,$/; __END__ Check memory now. Benchmark: timing 20 iterations of for, map, slice... for: 16 wallclock secs (15.47 usr + 0.27 sys = 15.74 CPU) @ 1 +.27/s (n=20) map: 28 wallclock secs (26.67 usr + 0.30 sys = 26.97 CPU) @ 0 +.74/s (n=20) slice: 14 wallclock secs (13.54 usr + 0.24 sys = 13.78 CPU) @ 1 +.45/s (n=20) Rate map for slice map 0.742/s -- -42% -49% for 1.27/s 71% -- -12% slice 1.45/s 96% 14% -- 59853
    Update: had mixed up the for and slice labels. for doesn't win.

    Makeshifts last the longest.

      Hmmm, after hearing the responses above, I bet that the memory differences come from both a) floats are larger than short strings and b) there is more structure involved in this hash, which is what is causing the faster lookup times.

      Also, you might want to correct the labels in your benchmarks, lest we be led astray!

      Thanks!
          Dean


      If we didn't reinvent the wheel, we wouldn't have rollerblades.

        What the.. I must have been half asleep already. Sorry. Fixed it, and now I'm off to bed. Thanks for the wrist slap.

        Makeshifts last the longest.

Big Picture
by PhiRatE (Monk) on Sep 30, 2002 at 23:12 UTC
    I have a simple question, what on earth are you doing that could require a 4 million entry hash? the collisions on that in a 32bit space are not going to be pretty, and the build time will be silly. Unless you're doing a shitload of lookups its not going to be more efficient than calculating the entries on the fly, especially if your system is low on ram.

    More details on the actual problem would help :)

      Why, I'm classifying codimension two hyperplane arrangements in C2. What else would I be doing?

      Basically, I need to create all of the permutations of 1 to n, then for each one, apply a set of three rules which will give me a list of different permutations which shall be deemed "equivalent" to the current one. I give each of these a common tag and then move on.

      I do perform quite a few lookups to determine whether I have already applied the set of rules to a particular permutation through one of its equivalent representations. This is important since each of the rules can give me many equivalent permutations.

      At the end I perform one final clean up since sets of equivalent permutations may have been labeled with different tags. I make note of these occurrances during the first pass through though, so the final cleanup is trivial.

      Thanks for asking! (but I bet you're sorry you did)
          Dean

      Update: Sorry, we're in C2 not C4


      If we didn't reinvent the wheel, we wouldn't have rollerblades.

        Thanks for asking! (but I bet you're sorry you did)

        You're not wrong there :) :)

        Its situations like this where MathML or equivalent would be really nice, since you could just give me the forumlae :)

        In essence, what I understood was the following:

        For every permutation 1->n check every permutation 1->n to see if it is equivalent according to the rules. If it is, mark it as such.

        Results: a set of groups of permutations who are deemed equivalent.

        Are the rules transitive?

        My head generated psuedocode something like this (no efficiency or anything, just a general algorithm):

        foreach $p_1 (permutation(1..n)) { if ($tags{$p_1}) { next; } $tags{$p_1} = $p_1; foreach $p_2 (permutation(1..n)) { if (rule1($p_1, $p_2) && rule2($p_1,$p_2) && rule3($p_1, $p_2)) + { $tags{$p_2} = $p_1; } } }

        But that only works if the rules are both transitive and symmetric (you hint that they are but I may be reading it wrong)

        I would be interested in the contents of the rules 1->3 if they're not going to make my head ache :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://201813]
Approved by Aristotle
Front-paged by Aristotle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-04-16 22:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found