Re: Optimizing Iterating over a giant hash
by jethro (Monsignor) on Dec 25, 2009 at 23:47 UTC
|
You might use a divide and conquer method.
Store the data sequentially into files, one file for each different $entry value. If that number is too big (bigger than the number of files you can have open at any one time), you might group the entry values with a suitable scheme.
The difference to DMBM:Deep is that you write to these files sequentially, it should be much faster because of buffering.
After that you can reread and work through the files one by one. Hopefully now a single file should fit into memory without having to resort to swapping (which probably makes your application slow at the moment)
| [reply] |
Re: Optimizing Iterating over a giant hash
by kyle (Abbot) on Dec 26, 2009 at 05:27 UTC
|
my $temp;
while (($entry,$temp) = each %my_hash) {
foreach $timeStamp ( sort keys %$temp) {
my @tskeys = sort keys %{$temp->{$timeStamp}};
print $Hentry "$entry\t$timeStamp\t", scalar @tskeys;
print $Hentry "\t$_\t", $my_hash{$entry}{$timeStamp}{$_} for @tske
+ys;
print $Hentry "\n";
}
}
What I've done here is factor out the repeated call to keys on %{$temp->{$timeStamp}} and turned the foreach my $id loop into the statement modifier form. The lack of a block might be faster. | [reply] [d/l] [select] |
|
Fully agree with eliminating the multiple call to keys.
I'm also a big believer in reducing nested hash references. For the small price of assigning another variable, you can eliminate the cost of several steps of hash key lookups, and improve readability as a bonus.
my $temp;
while (($entry,$temp) = each %my_hash) {
foreach $timeStamp ( sort keys %$temp) {
my $hash_ref = $temp->{$timeStamp};
my @tskeys = sort keys %$hash_ref;
print $Hentry "$entry\t$timeStamp\t", scalar @tskeys;
print $Hentry "\t$_\t", $hash_ref->{$_} for @tskeys;
print $Hentry "\n";
}
}
| [reply] [d/l] |
Re: Optimizing Iterating over a giant hash
by biohisham (Priest) on Dec 26, 2009 at 01:45 UTC
|
| [reply] [d/l] |
|
you might want to set $| to a nonzero, this can alleviate the 'out of memory!'
How does that work?
| [reply] |
|
When the program output is directed to the terminal the filehandle associated with the output will be in line-buffered mode but would be block-buffered otherwise (when the output is directed to a filehandle other than the terminal STDOUT)...Buffering can not be switched off but setting the output-autoflush variable to non-zero would make the print statement not wait for the buffer to flush rather than directly outputting to the file associated with the filehandle.
I understand that 'out of memory!' would flash when the buffer is not autoflushed in such cases and that there's large amount of data, now, if I am mistaken or cloudy on this one (I believe I am somehow) you're welcome to go ahead and correct/clarify for I am basically from a non-IT backrgound...
Excellence is an Endeavor of Persistence.
Chance Favors a Prepared Mind.
| [reply] [d/l] |
|
Re: Optimizing Iterating over a giant hash
by GrandFather (Saint) on Dec 28, 2009 at 21:21 UTC
|
| [reply] |
Re: Optimizing Iterating over a giant hash
by oron (Novice) on Dec 30, 2009 at 16:58 UTC
|
OK, problem solved :)
first of all thanks for all the answers.
What I eventually did was some version of divide and conquer.
When reading the inital data (filling the hash) I instead wrote it in a more suitable way for a new file that was then sorted (shamefully - with the linux sort util) and then i could process the lines that started out the same (same entry + timestamp) and out put. this also gave me a sorted result.
I liked the idea of seperating the internal hash to a list - this actually might decrease lookups and not run out of memory for those lists which are relatively short.
I did not use a database because i was under the impression that i need an sql server (for example) to be running and i don't have one. am i wrong? this could be usefull... | [reply] |
|
I strongly recommend you have a play with SQLite (DBD::SQLite) to dip your toe into the database waters. It is completely stand alone, even to the extent that the DBD 'driver' includes the database engine. It is ideal for the sort of application this threat discusses (although I'm not recommending you re-engineer your current solution). Having database tools in your toolbox is pretty likely to be useful to you, to the extent that having a bit of a play now is likely to pay dividends in the longer term.
True laziness is hard work
| [reply] |
|
There is, on the one hand, a Core-Module Called FileDB, wich enables you to ,simply saying, put a HASH on disk, but every access to the HASH then is a Filesystem-I/O operation (wich means it is slow).
There is another Module at CPAN, called BerkeleyDB, but the documentation has many TODOs in it.
| [reply] |
|
| [reply] |