Re: Bloom::Filter Usage

You might consider attacking the problem from the other end. That is, develop a list of possible duplicate accounts and then checking the complete list against it. I assume that you need to perform some processing on the record before inserting it even if it is not duplicate and probably some other processing if is might be a duplicate. Thus, most of the time on this program will be in processing the records. Using some additional parsing tools might represent a small cost in time to simplify the program considerably.

For example, you could create a new file, possibledups.txt, by parsing the original file using awk to get the account number, open and close dates. Pipe this result to sort and unique to get a (much smaller) list of possible duplicate accounts. Something like ...

gawk [some parse code] | sort | uniq -d > possibledups.txt

The processing script then, can read this duplicate file into a hash first. Then, as the script reads each record from the master file, it can compare those results against the hash of possible dups. That way, your code is spending most of its processing effort working on known unique records (which is probably simpler and faster than working on dups). In my experience, this approach can often simplify coding since the exception condition (duplicates) can be moved off to another subroutine.

ps. As a test, I generated a random file containing 30 million records and processed it using this pipe set in about 9 minutes (milage may vary).

PJ

Comment on Re: Bloom::Filter Usage Download Code

Replies are listed 'Best First'.
Re: Re: Bloom::Filter Usage by jreades (Friar) on Apr 22, 2004 at 14:44 UTC
This is exactly the approach that I'm going to have to take -- and it fits rather well with the logic of the process (I'm dealing with an ETL (Extract, Transform, Load) system and there are multiple stages to each job). FamousLongAgo very kindly sent me the updated v 0.2 code to try out and it definitely works, but unfortunately the way that I'm trying to use it doesn't because the number of duplicates is very low and the population very large. This helps me to end up with with a massive bitmap (431329181 bits) and ten hashing functions. If I knew more about hashing functions I might have been able to come up with a way to accelerate the hashing function by (as was suggested by others in this thread) optimising it for numeric keys of a preset size (12 bytes). As it stood, however, the level of overhead reduced the filter to the point where it took five to ten seconds per key! I really like the approach of using `uniq -d` and can only wish that it had occurred to me a couple of days ago since I would have managed to skip the banging of hand to head that just happened. There's enough memory and swap space to support sorting on 30 million records (this machine even has SyncSort installed). Thank you everyone for your helpful tips and suggestions.	[reply] [d/l]
Re: Re: Re: Bloom::Filter Usage by BrowserUk (Patriarch) on Apr 22, 2004 at 16:51 UTC
Update: If anyone would care to explain why this post rates a downvote, I'd be pleased to learn. If it would help, here is a badly named and almost undocumented module that may see light of day sometime. It will allow you to create an on-the-fly lookup table of your 30,000,000 12-digit numbers using under 256MB of ram. The performance isn't too bad at an average of 40 usecs per lookup in my test application below. Read more... (5 kB) My testcase. Read more... (1042 Bytes) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]


Don't ask to ask, just ask
	PerlMonks