Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:


I have a basic conceptual question about Perl's implementation of hashes. First let me give the problem background.

I have two tables in my database a{id, acc} and b{build, acc}. I almost always need to have {id, build} tuples and because my tables are fairly large (5 million rows each, roughly) I would like to create a c{id, build} table. Note that there an acc is not necessarily found in either a or b -- each table contains some not found in the other.

A straight SQL join statement is running horribly slow on my DB and our DBA can offer no suggestions for tuning.

As a work-around, I was thinking of retrieving all of table b into a hash: $data{acc} = build. The underlying hash relationship of one build per acc is guaranteed to be valid.

So, the question is, how can I determine how much memory this will take? Naively I would figure that it would require bytes(acc) + bytes(build) which in my case is 60. At that point I could just do a (large) number of hash-lookups, which should be relatively quick. I would write to file, then bulk-load the data and build indexes.

Only... 60 bytes for 5 million records appears to be about 300 MB. That's easily do-able but it doesn't take into account overhead.

Aside from the caveat that doing this in the DB properly ought to be faster, is my approach reasonable and will memory usage be anywhere close to 300 MB (I have about 2 GB available for this)?

Replies are listed 'Best First'.
Re: Space Efficiency of Hashes
by Stevie-O (Friar) on Mar 16, 2005 at 00:52 UTC
    Well, nothing beats a good TIAS with Devel::Size in the mix:
    #!/usr/bin/perl use strict; use warnings; my @printables = map chr, 33 .. 126; # random string 30 characters long sub randstr { join '', map { $printables[rand(@printables)] } 1..30 } use Devel::Size qw(size total_size); sub commas { my $str = ''.reverse shift; return scalar reverse join ',', grep length, split /(.{3})/, $str } my %hash = ( randstr => randstr ); for (0..5) { my $count = scalar keys %hash; print commas($count), " elements: ", commas(total_size\%hash), " b +ytes\n"; for (1..9 * $count) { my $s = randstr; $hash{$s} = randstr; } } my $count = scalar keys %hash; print commas($count), " elements: ", commas(total_size\%hash), " bytes +\n";
    These are the results it printed out on my machine, for v5.8.4:
    1 elements: 176 bytes 10 elements: 1,171 bytes 100 elements: 11,249 bytes 1,000 elements: 111,133 bytes 10,000 elements: 1,135,573 bytes 100,000 elements: 11,224,325 bytes 1,000,000 elements: 111,194,341 bytes
    (That last one took several minutes to complete on my system...)

    Taking my last result and multiplying by five, it seems that 5 million {acc, build} pairs will take only slightly more than half a gig of RAM to store all those hash entries.

    So, if your machine has 2GB of RAM -- you should be good to go.

    $"=$,,$_=q>|\p4<6 8p<M/_|<('=> .q>.<4-KI<l|2$<6%s!<qn#F<>;$, .=pack'N*',"@{[unpack'C*',$_] }"for split/</;$_=$,,y[A-Z a-z] {}cd;print lc

      Thanks (and to Joost as well): this indicates that I'll max out under 2 GB of RAM, leaving plenty free for other stuff running. I didn't think of just going ahead and testing overhead like that, but it seems a really obvious way to approach it in retrospect! Thanks for the enlightenment.

        You can also use Tie::SubsrHash and max memory required for you program will be around 300 MB.
Re: Space Efficiency of Hashes
by Joost (Canon) on Mar 16, 2005 at 00:32 UTC
    This all depends on how much you are exactly storing in your hash tables, but this code should give you a hint, just run it and watch top or some other resource monitor:
    #!perl -w use strict; $|=1; my (%a,%b,%c); for (0 .. 500_000) { # set higher if you dare $a{$_} = "x" x 60; # 60 character string $b{$_} = "x" x 60; $c{$_} = "x" x 60; print "\rrow $_" unless $_ % 10000; }

    note: this goes up to 190 mb for about 500_000 iterations on my machine YMMV.

Re: Space Efficiency of Hashes
by naChoZ (Curate) on Mar 16, 2005 at 04:21 UTC

    There have been some good answers already, but this sounds like something DBM::Deep could handle very well.

    "This alcoholism thing, I think it's just clever propaganda produced by people who want you to buy more bottled water." -- pedestrianwolf

      Wow - that is a cool looking Perl module. I'll have to try it out!
Re: Space Efficiency of Hashes
by jhourcle (Prior) on Mar 16, 2005 at 03:26 UTC
    A straight SQL join statement is running horribly slow on my DB and our DBA can offer no suggestions for tuning.

    Five million rows isn't that large, if the system is properly sized for it, and the database is tuned correctly. (which may be big 'if's).

    I know nothing about your data, or what database you're using, so it's really hard to give specific help. (for instance, you can override Oracle's join behavior by giving hints). Even in other languages, exactly how you write your SQL can result in you doing a sort-merge join vs. a correlated join... In some situations, indexes can hurt you, and you'll want to make sure the tables are properly analyzed by the database so it knows when those times are. EXPLAIN PLAN in Oracle, and EXPLAIN in mySQL are your friends.

    (I'm not going to say that your DBA hasn't already done this ... but I've worked with a few DBAs who weren't much more than an operator on a mainframe. Your database might not be tuned for this sort of operation, but that's no reason that you shouldn't be able to perform this in SQL, with a little prep work.)

      Oh I totally agree. This code is being ported from an Oracle server running on a dual-processor PC to DB2 running on a 20+ processor, 16 GB machine and yet performance has degraded incredibly. The table definitions and indices are identical, but I don't have any access privileges on DB2 to tune things so... this is a definite second choice until they get their act together.

        Let me guess ... and it's the same (Oracle) DBA who is helping tune your DB2 system. Different RDBMS, different magnitude of hardware ... same approach? If I were talking to the DBA, I would suggest asking IBM for help in tuning the migration. If I were the DBA, I would have tried to get that tuning in before plunking down the money. However, as neither case applies, the best you likely could do is suggest to the DBA that s/he talks to IBM, and suggest it in such a way that it sounds like the DBA's idea.

Re: Space Efficiency of Hashes
by tlm (Prior) on Mar 16, 2005 at 00:19 UTC

    I don't have a ready answer, but if I were confronting the same question, I'd run a smaller mock-up (with 1000 keys, say), and figure out a multiplier based on that.

    Actually, more truthfully/realistically, I'd probably just go with the naive approach, and if I discovered that I'm in deep doodoo (in the form of a script that slows down to a crawl), I'd think harder about it.

    With something in the range you're dealing with, I've often found myself needing to use tied hashes, just to keep Perl from spending 95% of the time swapping.

    the lowliest monk