Optimizing Iterating over a giant hash

oron has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a VERY big hash i've built with the folowing structure:
$my_hash{$entry}{$timeStamp}{$uid} = number;
Basically each (entry,timeStamp) pair should contain a list of users and a nubmer of each user.
I've built this to (naturally) gather information that was not grouped the way I wanted.
The problem is that I later need to print the entire hash so that for each pair (entry,timestamp) i print the list of user X number in the same row and this takes me a LONG time (hours - computer is still working).
I should mention that i can't use the "keys" keyword for iterating (keys %my_hash) because that leads to an "out of memory!" message (big hash...).
what i've done so far is this (works but take a long time)

my $temp;<br/>
while (($entry,$temp) = each %my_hash) {<br/>
  foreach $timeStamp ( sort keys %$temp) {<br/>
    print $Hentry "$entry\t$timeStamp\t" , scalar(keys %{$my_hash{$ent
+ry}{$timeStamp}}) ;<br/>
    foreach my $id (sort keys %{$temp->{$timeStamp}}) {<br/>
      print $Hentry "\t$id\t" , $my_hash{$entry}{$timeStamp}{$id} ;<br
+/>
    }<br/>
    print $Hentry "\n";<br/>
  }<br/>
}<br/>
[download]

That's the only thing that finally let me iterate without using too much memory.

So what im looking for is either a better way to store that data or (preferebly) a good way to iterate through the hash.
any suggestions?
Thanks :)

PS
I've tried using DBM:Deep to save on memory but it just takes to long to fill the hash this way (i have millions of entries...)

Comment on Optimizing Iterating over a giant hash Download Code

Replies are listed 'Best First'.
Re: Optimizing Iterating over a giant hash by jethro (Monsignor) on Dec 25, 2009 at 23:47 UTC
You might use a divide and conquer method. Store the data sequentially into files, one file for each different $entry value. If that number is too big (bigger than the number of files you can have open at any one time), you might group the entry values with a suitable scheme. The difference to DMBM:Deep is that you write to these files sequentially, it should be much faster because of buffering. After that you can reread and work through the files one by one. Hopefully now a single file should fit into memory without having to resort to swapping (which probably makes your application slow at the moment)	[reply]
Re: Optimizing Iterating over a giant hash by kyle (Abbot) on Dec 26, 2009 at 05:27 UTC
This might be a little better. `my $temp; while (($entry,$temp) = each %my_hash) { foreach $timeStamp ( sort keys %$temp) { my @tskeys = sort keys %{$temp->{$timeStamp}}; print $Hentry "$entry\t$timeStamp\t", scalar @tskeys; print $Hentry "\t$_\t", $my_hash{$entry}{$timeStamp}{$_} for @tske +ys; print $Hentry "\n"; } }` [download] What I've done here is factor out the repeated call to keys on `%{$temp->{$timeStamp}}` and turned the `foreach my $id` loop into the statement modifier form. The lack of a block might be faster.	[reply] [d/l] [select]
Re^2: Optimizing Iterating over a giant hash by ruzam (Curate) on Dec 26, 2009 at 20:00 UTC
Fully agree with eliminating the multiple call to keys. I'm also a big believer in reducing nested hash references. For the small price of assigning another variable, you can eliminate the cost of several steps of hash key lookups, and improve readability as a bonus. `my $temp; while (($entry,$temp) = each %my_hash) { foreach $timeStamp ( sort keys %$temp) { my $hash_ref = $temp->{$timeStamp}; my @tskeys = sort keys %$hash_ref; print $Hentry "$entry\t$timeStamp\t", scalar @tskeys; print $Hentry "\t$_\t", $hash_ref->{$_} for @tskeys; print $Hentry "\n"; } }` [download]	[reply] [d/l]
Re: Optimizing Iterating over a giant hash by biohisham (Priest) on Dec 26, 2009 at 01:45 UTC
Also check Suffering from Buffering, you might want to set $\| to a nonzero, this can alleviate the 'out of memory!', and just like jethro suggested, consider divide and conquer... Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.	[reply] [d/l]
Re^2: Optimizing Iterating over a giant hash by chromatic (Archbishop) on Dec 26, 2009 at 09:23 UTC
you might want to set $\| to a nonzero, this can alleviate the 'out of memory!' How does that work?	[reply]
Re^3: Optimizing Iterating over a giant hash by biohisham (Priest) on Dec 26, 2009 at 15:59 UTC
When the program output is directed to the terminal the filehandle associated with the output will be in line-buffered mode but would be block-buffered otherwise (when the output is directed to a filehandle other than the terminal STDOUT)...Buffering can not be switched off but setting the output-autoflush variable to non-zero would make the print statement not wait for the buffer to flush rather than directly outputting to the file associated with the filehandle. I understand that 'out of memory!' would flash when the buffer is not autoflushed in such cases and that there's large amount of data, now, if I am mistaken or cloudy on this one (I believe I am somehow) you're welcome to go ahead and correct/clarify for I am basically from a non-IT backrgound... Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.	[reply] [d/l]
Re^4: Optimizing Iterating over a giant hash by chromatic (Archbishop) on Dec 26, 2009 at 19:28 UTC
Re: Optimizing Iterating over a giant hash by GrandFather (Saint) on Dec 28, 2009 at 21:21 UTC
Why are you not using a database? You may find Re: Data Structure Question a useful starting point. True laziness is hard work	[reply]
Re: Optimizing Iterating over a giant hash by oron (Novice) on Dec 30, 2009 at 16:58 UTC
OK, problem solved :) first of all thanks for all the answers. What I eventually did was some version of divide and conquer. When reading the inital data (filling the hash) I instead wrote it in a more suitable way for a new file that was then sorted (shamefully - with the linux sort util) and then i could process the lines that started out the same (same entry + timestamp) and out put. this also gave me a sorted result. I liked the idea of seperating the internal hash to a list - this actually might decrease lookups and not run out of memory for those lists which are relatively short. I did not use a database because i was under the impression that i need an sql server (for example) to be running and i don't have one. am i wrong? this could be usefull...	[reply]
Re^2: Optimizing Iterating over a giant hash by GrandFather (Saint) on May 23, 2010 at 20:36 UTC
I strongly recommend you have a play with SQLite (DBD::SQLite) to dip your toe into the database waters. It is completely stand alone, even to the extent that the DBD 'driver' includes the database engine. It is ideal for the sort of application this threat discusses (although I'm not recommending you re-engineer your current solution). Having database tools in your toolbox is pretty likely to be useful to you, to the extent that having a bit of a play now is likely to pay dividends in the longer term. True laziness is hard work	[reply]
Re^2: Optimizing Iterating over a giant hash by patrick.j (Acolyte) on May 23, 2010 at 17:19 UTC
There is, on the one hand, a Core-Module Called FileDB, wich enables you to ,simply saying, put a HASH on disk, but every access to the HASH then is a Filesystem-I/O operation (wich means it is slow). There is another Module at CPAN, called BerkeleyDB, but the documentation has many TODOs in it.	[reply]
Re^3: Optimizing Iterating over a giant hash by Anonymous Monk on May 23, 2010 at 21:48 UTC
There is another Module at CPAN, called BerkeleyDB, but the documentation has many TODOs in it. The TODOs are 100% for redundant documentation. Like it says The reader is expected to be familiar with the Berkeley DB documentation. Where the interface provided here is identical to the Berkeley DB library and the... http://www.oracle.com/technology/products/berkeley-db/db/index.html perl examples at http://search.cpan.org/dist/BerkeleyDB/MANIFEST	[reply]


The stupid question is the question not asked
	PerlMonks