End of the Time Out Tireny

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: End of the Time Out Tireny by matthewb (Curate) on Feb 05, 2004 at 18:16 UTC
Not wishing to rain on your parade, but what happens if one of the duplicate-pairs you are trying to eliminate is split over two of the files you created? I think the advice you were given the first time you brought this up was probably closer to the right track. Best of luck with it anyway. Do you mean `tyranny'? MB	[reply]
Re: End of the Time Out Tireny by iburrell (Chaplain) on Feb 05, 2004 at 23:45 UTC
How are you processing the list so that 100 addresses take 1 seconds to find duplicates? If you use a hash, Perl should find duplicates in 100 lines in milliseconds. With the whole file, you might have to worry about memory usage. Without swapping, the processing should be very fast. `my %seen; while (my $line = <$fh>) { chomp($line); print "$line\n" unless seen{$line}; $seen{$line}++; }` [download]	[reply] [d/l]
Re: Re: End of the Time Out Tyranny by TIURIC (Initiate) on Feb 06, 2004 at 17:52 UTC
Actually it does not take 1 second, I give the refresh a one second pause bettween each page. I should have been more clear. I am definitely not writing a SPAM program. I'm not a fan of SPAMMERs. Thanks for the enlightenment. TIURIC	[reply]
Re: End of the Time Out Tireny by flyingmoose (Priest) on Feb 05, 2004 at 20:34 UTC
You wouldn't be writing a More Agreeable SPAMBot would you? No, that's silly...they don't care about duplicate SPAM messages :) I have a bad habit of doing this (that is, dodging the question), but for you this is the time to use a relational (SQL) database and force the email address field to be UNIQUE, or else do a "SELECT UNIQUE * FROM" or something like that. Your flatfile technique is bound to be slower. 1200sec/60(sec/min) = 20 minutes. A database could fly through this sort of thing, and then you would have a system that would be better to write future tools on top of. Postgre and MySQL are both great free DB servers available for most platforms. I tend to prefer Postgres for historical reasons about MySQL (most reasons no longer valid), both are rapidly imporving.	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks