Hi,
Improvements are always welcome! Respect if you can read that huge file of messy code! :)
I did made a new version with some comments in the code, should I upload this one, maybe a tiny bit more clear?
The test datasets also have config files that are ready to use (any additional questions you may ask)
Those test datasets are very small so will go fast, but most users will have very large datasets (so large hashes).
Loading all the data (can be around 600 GB of raw data) in the hashes is relatively slow, but not sure if much improvements are possible there.
A huge improvement would be parallelisation of the code after loading the hashes, which I tried with a few methods, but they turned out slowing down the process or impossible because it would duplicate the hash (similar problem as before).
I know nobody that knows Perl, so I am the only that looked at the code, so always welcome and if you see something that would improve the speed of memory efficiency greatly I can add you to the next paper. To make improvements in what it does, I think you need a genetic background.
Greets