Re^2: System call doesn't work when there is a large amount of data in a hash

Hi,

Improvements are always welcome! Respect if you can read that huge file of messy code! :)
I did made a new version with some comments in the code, should I upload this one, maybe a tiny bit more clear?
The test datasets also have config files that are ready to use (any additional questions you may ask)

Those test datasets are very small so will go fast, but most users will have very large datasets (so large hashes).
Loading all the data (can be around 600 GB of raw data) in the hashes is relatively slow, but not sure if much improvements are possible there.
A huge improvement would be parallelisation of the code after loading the hashes, which I tried with a few methods, but they turned out slowing down the process or impossible because it would duplicate the hash (similar problem as before).

I know nobody that knows Perl, so I am the only that looked at the code, so always welcome and if you see something that would improve the speed of memory efficiency greatly I can add you to the next paper. To make improvements in what it does, I think you need a genetic background.

Greets

Comment on Re^2: System call doesn't work when there is a large amount of data in a hash

Replies are listed 'Best First'.
Re^3: System call doesn't work when there is a large amount of data in a hash by 1nickt (Canon) on May 01, 2020 at 01:11 UTC
Hi again, I'll just suggest once more that you let go of the idea that you must load all your data into an in-memory hash in order for your program to be fast. For one very fast approach please look at `mce_map_f` in MCE::Map (also by the learned marioroy) which is written especially for optimized parallel processing of huge files. (As an aside, have you profiled your code? I would think that Perl could load data from anywhere (file, database, whatever) faster than a shell call to an external analytical program would return ... or does your program not expect a response?) As far as your finding that "parallelisation of the code after loading the hashes ... turned out slowing down the process or impossible because it would duplicate the hash" ... please see MCE::Shared::Hash. Hope this helps! The way forward always starts with a minimal test.	[reply] [d/l]
Re^4: System call doesn't work when there is a large amount of data in a hash by Nicolasd (Acolyte) on May 01, 2020 at 11:37 UTC
Hi, I think I tried MCE::Map a few years ago, but will check it to be sure. I tried many methods so that is why I am convinced about the big hash, but I could be wrong of course, as there is much of Perl I don't know. But small differences in speed will make a big difference because the script has to access the hash millions of time (I actually build 3 hashes), so some alternatives work fine at first sight, but on large datasets it slows down a lot. Similar software (in C++ or python) usually need even more memory than mine (although they use a different graph based method so hard to compare) (As an aside, have you profiled your code? I would think that Perl could load data from anywhere (file, database, whatever) faster than a shell call to an external analytical program would return ... or does your program not expect a response?) Sorry I don't understand the question, is this about the system call? And I guess I didn't profile the code, as I don't know what that means :) I think I tried this one (MCE::Shared::Hash) and it turned out too slow, but again I need to verify this, I will check If find the code, else I will try it. Thanks	[reply]
Re^5: System call doesn't work when there is a large amount of data in a hash by marto (Cardinal) on May 01, 2020 at 11:48 UTC
See Devel::NYTProf.	[reply]
Re^6: System call doesn't work when there is a large amount of data in a hash by Nicolasd (Acolyte) on May 01, 2020 at 12:30 UTC
Re^7: System call doesn't work when there is a large amount of data in a hash by hippo (Bishop) on May 01, 2020 at 12:40 UTC
Some notes below your chosen depth have not been shown here