comment on

You might consider attacking the problem from the other end. That is, develop a list of possible duplicate accounts and then checking the complete list against it. I assume that you need to perform some processing on the record before inserting it even if it is not duplicate and probably some other processing if is might be a duplicate. Thus, most of the time on this program will be in processing the records. Using some additional parsing tools might represent a small cost in time to simplify the program considerably.

For example, you could create a new file, possibledups.txt, by parsing the original file using awk to get the account number, open and close dates. Pipe this result to sort and unique to get a (much smaller) list of possible duplicate accounts. Something like ...

gawk [some parse code] | sort | uniq -d > possibledups.txt

The processing script then, can read this duplicate file into a hash first. Then, as the script reads each record from the master file, it can compare those results against the hash of possible dups. That way, your code is spending most of its processing effort working on known unique records (which is probably simpler and faster than working on dups). In my experience, this approach can often simplify coding since the exception condition (duplicates) can be moved off to another subroutine.

ps. As a test, I generated a random file containing 30 million records and processed it using this pipe set in about 9 minutes (milage may vary).

PJ

In reply to Re: Bloom::Filter Usage by periapt
in thread Bloom::Filter Usage by jreades

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl-Sensitive Sunglasses
	PerlMonks