jreades has asked for the wisdom of the Perl Monks concerning the following question:
I have a large file (30 million records) containing a two-field key that is supposed to be unique. Unfortunately, it isn't necessarily that way... let me try to explain:
- The file comes to me with these fields (amongst others):
- Account Number
- Account Open Date
- Account Close Date
- Account numbers can be reused, but only if the account has been closed. Unless you have data issues. Which I do. :(
- So I need a quick way to check while processing (amongst up to 30 million unique keys) whether I have seen this key before (i.e. whether it's a dupe that needs checking).
- The normal hash lookup method $cache{$account} works, but with ever-decreasing performance and an ever-increasing memory profile (it got up to 1.4GB of memory usage).
- So a Perl monger suggested looking at Bloom filters. More on Bloom filters is available here: Perl.com
I've found the Bloom::Filter module in CPAN but can't get it to work and am also worried about what level of false-positives I'm facing.
My current code is:
my $bloom_filter = Bloom::Filter->new(error_rate => 0.001, capacity => + 30000000); if ($bloom_filter->check($account_number)) { ... do deduping ... } else { $bloom_filter->add($account_number); ... do something ... }
I'm looking for wisdom on two fronts:
- My current code just warns of a lack of salts for the filter, but I can't determine for the life of me what this module is looking for in terms of a salt as the documentation is... minimal... and everything I've tried adding causes my script to malfunction in spectacular ways
- The level of false positive matches is inverseley proportional to the memory footprint of the filter but I am having trouble with the calculations that would enable me to locate the right tradeoff.
Thanks
Back to
Seekers of Perl Wisdom