Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

End of the Time Out Tireny

by Anonymous Monk
on Feb 05, 2004 at 17:51 UTC ( [id://326823]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re: End of the Time Out Tireny
by matthewb (Curate) on Feb 05, 2004 at 18:16 UTC

    Not wishing to rain on your parade, but what happens if one of the duplicate-pairs you are trying to eliminate is split over two of the files you created?

    I think the advice you were given the first time you brought this up was probably closer to the right track.

    Best of luck with it anyway. Do you mean `tyranny'?

    MB
Re: End of the Time Out Tireny
by iburrell (Chaplain) on Feb 05, 2004 at 23:45 UTC
    How are you processing the list so that 100 addresses take 1 seconds to find duplicates? If you use a hash, Perl should find duplicates in 100 lines in milliseconds. With the whole file, you might have to worry about memory usage. Without swapping, the processing should be very fast.
    my %seen; while (my $line = <$fh>) { chomp($line); print "$line\n" unless seen{$line}; $seen{$line}++; }
      Actually it does not take 1 second, I give the refresh a one second pause bettween each page. I should have been more clear. I am definitely not writing a SPAM program. I'm not a fan of SPAMMERs. Thanks for the enlightenment. TIURIC
Re: End of the Time Out Tireny
by flyingmoose (Priest) on Feb 05, 2004 at 20:34 UTC
    You wouldn't be writing a More Agreeable SPAMBot would you? No, that's silly...they don't care about duplicate SPAM messages :)

    I have a bad habit of doing this (that is, dodging the question), but for you this is the time to use a relational (SQL) database and force the email address field to be UNIQUE, or else do a "SELECT UNIQUE * FROM" or something like that.

    Your flatfile technique is bound to be slower. 1200sec/60(sec/min) = 20 minutes. A database could fly through this sort of thing, and then you would have a system that would be better to write future tools on top of.

    Postgre and MySQL are both great free DB servers available for most platforms. I tend to prefer Postgres for historical reasons about MySQL (most reasons no longer valid), both are rapidly imporving.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://326823]
Approved by neuroball
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2024-04-19 17:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found