Don't ask to ask, just ask | |
PerlMonks |
Re: Bloom::Filter Usageby periapt (Hermit) |
on Apr 21, 2004 at 17:15 UTC ( [id://347067]=note: print w/replies, xml ) | Need Help?? |
You might consider attacking the problem from the other end. That is, develop a list of possible duplicate accounts and then checking the complete list against it. I assume that you need to perform some processing on the record before inserting it even if it is not duplicate and probably some other processing if is might be a duplicate. Thus, most of the time on this program will be in processing the records. Using some additional parsing tools might represent a small cost in time to simplify the program considerably. For example, you could create a new file, possibledups.txt, by parsing the original file using awk to get the account number, open and close dates. Pipe this result to sort and unique to get a (much smaller) list of possible duplicate accounts. Something like ... gawk [some parse code] | sort | uniq -d > possibledups.txt The processing script then, can read this duplicate file into a hash first. Then, as the script reads each record from the master file, it can compare those results against the hash of possible dups. That way, your code is spending most of its processing effort working on known unique records (which is probably simpler and faster than working on dups). In my experience, this approach can often simplify coding since the exception condition (duplicates) can be moved off to another subroutine. ps. As a test, I generated a random file containing 30 million records and processed it using this pipe set in about 9 minutes (milage may vary). PJ
In Section
Seekers of Perl Wisdom
|
|