comment on

A few tips that haven't been mentioned (and are also not to be considered complete).

First, you didn't say if your data was ordered or not. If it happens to be ordered by either feild, then you do not need to put much effort at all into that dup check. Just keep the last item found. That will be all you need to check for to see if the next item is a duplicate.

Of course if they are not ordered, then this will not be as good of a solution. You should still consider it though. For instance, if the files are semi-ordered, that is, there may be about 5-10% mis-ordering, but otherwise it's in the right order, then you can still use the same routine, but instead you use the last field as a water mark sort of value -- that is, if you come across a value that is lower, it gets set to ++$water_mark.

Also, if the files are completely not ordered, you may want to simply sort them beforehand, this initial cost can easily outway the memory cost of your hash.

Another, much simpler, method is to completely get rid of the serial numbers in the file and just start at 0 or 1 for the first record and count up. This only works if you don't care about your serial numbers changing each time you run this program. This is good because then you can also use the previous mentioned idea of sorting by phone number to do your dup-check for phone numbers, and then avoid the dup-check on serials by simply making up your own.

Like I said, many caveats. But depending on what you are doing, these can really speed things up.

Oh, one other things, if you have multiple files, and you sort beforehand you can keep your files separate, but you'll need to open all the files up at once so that you can read from the current lowest one.

Ciao,
Gryn

In reply to Somethings not mentioned yet by gryng
in thread Efficiency and Large Arrays by Kozz

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Come for the quick hacks, stay for the epiphanies.
	PerlMonks