How to remove duplicates from a large set of keys

nite_man has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to remove duplicates from a large set of keys by Corion (Patriarch) on Feb 10, 2005 at 08:16 UTC
In the end, you will still need to have all keys in memory, or at least accessible, that means, you will need some kind of hash, either a plain (in memory) hash, or a tied hash, that you tie to a dbm file for example. You can possibly save on the side of the keys, by generating the checksums for the keys yourself, by using Digest::MD5 or something comparable, but that will only help you as long as your keys are on average longer than their MD5 length. You can also consider building a trie of your keys by building a linked list of keys with a common start, either a letter or a string. This increases the number of lookups you need to make, but can reduce the amount of memory you need, of your keys are long enough and have enough common prefixes. Still, a million keys shouldn't eat too much memory - about 1 million*32 bytes for the hash entries, plus the length of the keys in bytes.	[reply]
Re^2: How to remove duplicates from a large set of keys by nite_man (Deacon) on Feb 10, 2005 at 08:44 UTC
Thanks for your repaly, Corion. In the end, you will still need to have all keys in memory, or at least accessible Why, in case of using a database I can just try to insert a new value. If that value is already exists in the table I'll get an exception 'Cannot insert a duplicated value bla-bla-bla'. But otherwise a new value will be inserted the the table. a million keys shouldn't eat too much memory The most important criterion for me is a speed of processing of new values. I haven't use databse approach yet but in case of using a hash a processing of one value takes about 40 seconds with 1 million hash keys. But the number of keys is increased and the time increased too. --- Michael Stepanov aka nite_man It's only my opinion and it doesn't have pretensions of absoluteness!	[reply]
Re^3: How to remove duplicates from a large set of keys by Tanktalus (Canon) on Feb 10, 2005 at 15:29 UTC
Whether you have your million records in memory (fast) or on disk in a database (slow), you have to take the time to insert your new data. Looking up existing data is different - as explained, looking up in a hash is O(1): you take the key, perform a calculation on it (which is dependant on the length of the key, not the size of the hash), and go to that entry in the (associative) array. Looking up in a database cannot be any faster than O(1). It can be as bad as O(log N) (I can't imagine any database doing an index lookup any slower than a binary search), which is dependant on the number of data points you're comparing to. The only way that a database could be faster is if it's a big honkin' box with lots of RAM, and that's a different box from your perl client. This problem is one of the primary reasons to use a hash. (Not the only one, but one of them nonetheless.)	[reply]
Re: How to remove duplicates from a large set of keys by demerphq (Chancellor) on Feb 10, 2005 at 08:46 UTC
Put the keys in a file, sort them by key at the file level (there are any number of tools to do this). Scan the file, any duplicates will be adjacent so just skip them as you scan. You may even find that the sort program will remove the dupes for you. --- demerphq	[reply]
Re^2: How to remove duplicates from a large set of keys by Anonymous Monk on Feb 10, 2005 at 10:00 UTC
Please read the entire posting before shooting of a reply. Had you bothered to read the entire first paragraph (I know, I know, it's three lines, waaaaaaay too long), you could have read: In real time I should check does a new value exist in my set and if not to add it. Resorting a million record file each time a record in added isn't very efficient.	[reply]
Re: How to remove duplicates from a large set of keys by BrowserUk (Patriarch) on Feb 10, 2005 at 11:33 UTC
What is the overall range, ie. highest and lowest values of your numbers? If the range is something reasonable, you can store 1 bit/per number in the overall range and use vec to test/set those bits. If you need the check persistant, save the bitstring in a file. For a range 0 to 1,000,000, you just need 1/4 MB and it will be very fast. Even if your numbers are say 10 digit numbers, this still only requires 120MB in memory or on disk, and is still extremly fast. If the range is greater than you can comforatably fit in memory, but within acceptable limits for a diskfile ( unsigned 32-bit ints requires .5 GB), then doing the math, reading/checking/setting/re-writing the appropriate bit&byte is still pretty quick. The following code reads and writes individual bytes for every bit and can deal with checking a million (setting 600,000 of them) in just over 10 seconds. If they are already set, that time drops to 6.5 seconds. You could try adding some buffering if your numbers tend to run in batches rather than being completely random, but the buffering logic tends to slow things in many cases. You could try ':byte' and read/seek/write instead of ':raw' and sys* and see if PerlIO buffering buys you anything. Some crude code & timings: Read more... (2 kB) Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l]
Re: How to remove duplicates from a large set of keys by jbrugger (Parson) on Feb 10, 2005 at 08:34 UTC
I'd create a lookup table: `#pseudo: my %looupTable; if (!$lookupTable{$yourkeyname}) { $lookupTable{$yourkeyname} = 1; }` [download] A million keys should run fine on a decent machine.	[reply] [d/l]
Re^2: How to remove duplicates from a large set of keys by nite_man (Deacon) on Feb 10, 2005 at 08:46 UTC
Unfortunately, lookup in the hash with more than million keys takes a lot of time. I suspect that there is a way to do it more efficient. --- Michael Stepanov aka nite_man It's only my opinion and it doesn't have pretensions of absoluteness!	[reply]
Re^3: How to remove duplicates from a large set of keys by demerphq (Chancellor) on Feb 10, 2005 at 08:50 UTC
Lookup in a hash of 1 million keys is rougly the same as lookup in a hash of 10 keys. :-) (Assuming you are still inside of physical memory.) Its creating the hash thats the problem. It takes a long time, especially if you dont know how many records you are storing up front. --- demerphq	[reply]
Re^4: How to remove duplicates from a large set of keys by nite_man (Deacon) on Feb 10, 2005 at 09:04 UTC
Re^5: How to remove duplicates from a large set of keys by demerphq (Chancellor) on Feb 10, 2005 at 09:20 UTC
Re^3: How to remove duplicates from a large set of keys by jbrugger (Parson) on Feb 10, 2005 at 08:59 UTC
Is it? This script took 6 seconds to complete (run), with 2 million keys, on a Celeron 2.4 Ghz, 512 Mb, running X, Mozilla, Apache, john, etc. etc. and took at max. 27 % of the memory `#!/usr/bin/perl -w use strict; my %lookupTable; for (my $i=0; $i<=2000000; $i++) { if (!$lookupTable{$i}) { $lookupTable{$i} = 1; } }` [download]	[reply] [d/l]
Re^4: How to remove duplicates from a large set of keys by FitTrend (Pilgrim) on Feb 10, 2005 at 13:54 UTC
Re: How to remove duplicates from a large set of keys by blazar (Canon) on Feb 10, 2005 at 08:59 UTC
One way is to use a hash. But it takes a lot of time when number of values is increased. I'm not sure if this is a good answer to your problem, but it may indeed depend on how you store the key-value pairs in you hash. You may be interested in the use of keys as an lvalue.	[reply]
Re: How to remove duplicates from a large set of keys by cog (Parson) on Feb 10, 2005 at 09:25 UTC
Another idea would be to do it several times... like, in the first time, you would only treat keys starting with an "A", in the second time, with a "B"... Readjust to fit your particular keys :-)	[reply]
Re: How to remove duplicates from a large set of keys by Thilosophy (Curate) on Feb 10, 2005 at 11:37 UTC
With a million keys, you should go for the database. If you only need hash lookups (as opposed to all the query stuff you can do with a relational database), you could give BerkeleyDB a try, which even ships with Perl (as the DB_File module). Instead of using a query language to manipulate data, it pretends to be a Perl hash.	[reply]
Re^2: How to remove duplicates from a large set of keys by BrowserUk (Patriarch) on Feb 10, 2005 at 14:28 UTC
With a million keys, you should go for the database. Why? The OP was concerned with speed. I see this as another (merlyn-style) "bad meme". There are plenty of very good reasons for using a database, but speed is not one of them! Using DB_file is takes over 5 minutes to do what this code does in under 10 seconds. And that is once you've worked out how. The sample code from DB_File does not even compile as printed. It may be possible to improve that 5 minutes, if you hunt the internet to locate, read, and understand the Berkeley DB optimisation and configuration advice, but you'll never get near direct file access for performance in this application. Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply]
Re: How to remove duplicates from a large set of keys by Anonymous Monk on Feb 10, 2005 at 10:09 UTC
Both a database and a hash will do the trick. Which one is better for your application depends on a lot of things. Up to a certain number of keys, a hash will be faster. But if the number of keys grows, Perl will use more and more memory, and at a certain point, it will start trashing, and eventually run out of memory. Somewhere along the way, it's going to be more efficient to use a database, specially if the database is on another box. But where that point is depends on the number of keys, the number of inserts, the number of queries, and the amount of memory your program has available. It's something only you can test.	[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks