Re: How to remove duplicates from a large set of keys
by Corion (Patriarch) on Feb 10, 2005 at 08:16 UTC
|
In the end, you will still need to have all keys in memory, or at least accessible, that means, you will need some kind of hash, either a plain (in memory) hash, or a tied hash, that you tie to a dbm file for example.
You can possibly save on the side of the keys, by generating the checksums for the keys yourself, by using Digest::MD5 or something comparable, but that will only help you as long as your keys are on average longer than their MD5 length. You can also consider building a trie of your keys by building a linked list of keys with a common start, either a letter or a string. This increases the number of lookups you need to make, but can reduce the amount of memory you need, of your keys are long enough and have enough common prefixes. Still, a million keys shouldn't eat too much memory - about 1 million*32 bytes for the hash entries, plus the length of the keys in bytes.
| [reply] |
|
Thanks for your repaly, Corion.
In the end, you will still need to have all keys in memory, or at least accessible
Why, in case of using a database I can just try to insert a new value. If that value is already exists in the table I'll get an exception 'Cannot insert a duplicated value bla-bla-bla'. But otherwise a new value will be inserted the the table.
a million keys shouldn't eat too much memory
The most important criterion for me is a speed of processing of new values. I haven't use databse approach yet but in case of using a hash a processing of one value takes about 40 seconds with 1 million hash keys. But the number of keys is increased and the time increased too.
---
Michael Stepanov aka nite_man
It's only my opinion and it doesn't have pretensions of absoluteness!
| [reply] |
|
Whether you have your million records in memory (fast) or on disk in a database (slow), you have to take the time to insert your new data. Looking up existing data is different - as explained, looking up in a hash is O(1): you take the key, perform a calculation on it (which is dependant on the length of the key, not the size of the hash), and go to that entry in the (associative) array. Looking up in a database cannot be any faster than O(1). It can be as bad as O(log N) (I can't imagine any database doing an index lookup any slower than a binary search), which is dependant on the number of data points you're comparing to.
The only way that a database could be faster is if it's a big honkin' box with lots of RAM, and that's a different box from your perl client.
This problem is one of the primary reasons to use a hash. (Not the only one, but one of them nonetheless.)
| [reply] |
Re: How to remove duplicates from a large set of keys
by demerphq (Chancellor) on Feb 10, 2005 at 08:46 UTC
|
Put the keys in a file, sort them by key at the file level (there are any number of tools to do this). Scan the file, any duplicates will be adjacent so just skip them as you scan. You may even find that the sort program will remove the dupes for you.
| [reply] |
|
Please read the entire posting before shooting of a reply. Had you bothered to read the entire first paragraph (I know, I know, it's three lines, waaaaaaay too long), you could have read:
In real time I should check does a new value exist in my set and if not to add it.
Resorting a million record file each time a record in added isn't very efficient.
| [reply] |
Re: How to remove duplicates from a large set of keys
by BrowserUk (Patriarch) on Feb 10, 2005 at 11:33 UTC
|
What is the overall range, ie. highest and lowest values of your numbers?
If the range is something reasonable, you can store 1 bit/per number in the overall range and use vec to test/set those bits. If you need the check persistant, save the bitstring in a file.
For a range 0 to 1,000,000, you just need 1/4 MB and it will be very fast.
Even if your numbers are say 10 digit numbers, this still only requires 120MB in memory or on disk, and is still extremly fast.
If the range is greater than you can comforatably fit in memory, but within acceptable limits for a diskfile ( unsigned 32-bit ints requires .5 GB), then doing the math, reading/checking/setting/re-writing the appropriate bit&byte is still pretty quick.
The following code reads and writes individual bytes for every bit and can deal with checking a million (setting 600,000 of them) in just over 10 seconds. If they are already set, that time drops to 6.5 seconds.
You could try adding some buffering if your numbers tend to run in batches rather than being completely random, but the buffering logic tends to slow things in many cases. You could try ':byte' and read/seek/write instead of ':raw' and sys* and see if PerlIO buffering buys you anything.
Some crude code & timings:
Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
| [reply] [d/l] |
Re: How to remove duplicates from a large set of keys
by jbrugger (Parson) on Feb 10, 2005 at 08:34 UTC
|
I'd create a lookup table:
#pseudo:
my %looupTable;
if (!$lookupTable{$yourkeyname}) {
$lookupTable{$yourkeyname} = 1;
}
A million keys should run fine on a decent machine. | [reply] [d/l] |
|
| [reply] |
|
Lookup in a hash of 1 million keys is rougly the same as lookup in a hash of 10 keys. :-) (Assuming you are still inside of physical memory.)
Its creating the hash thats the problem. It takes a long time, especially if you dont know how many records you are storing up front.
| [reply] |
|
|
|
Is it?
This script took 6 seconds to complete (run), with 2 million keys, on a Celeron 2.4 Ghz, 512 Mb, running X, Mozilla, Apache, john, etc. etc. and took at max. 27 % of the memory
#!/usr/bin/perl -w
use strict;
my %lookupTable;
for (my $i=0; $i<=2000000; $i++) {
if (!$lookupTable{$i}) {
$lookupTable{$i} = 1;
}
}
| [reply] [d/l] |
|
Re: How to remove duplicates from a large set of keys
by blazar (Canon) on Feb 10, 2005 at 08:59 UTC
|
One way is to use a hash. But it takes a lot of time when number of values is increased.
I'm not sure if this is a good answer to your problem, but it may indeed depend on how you store the key-value pairs in you hash. You may be interested in the use of keys as an lvalue.
| [reply] |
Re: How to remove duplicates from a large set of keys
by cog (Parson) on Feb 10, 2005 at 09:25 UTC
|
| [reply] |
Re: How to remove duplicates from a large set of keys
by Thilosophy (Curate) on Feb 10, 2005 at 11:37 UTC
|
With a million keys, you should go for the database.
If you only need hash lookups (as opposed to all the query stuff you can do with a relational database), you could give BerkeleyDB a try, which even ships with Perl (as the DB_File module). Instead of using a query language to manipulate data, it pretends to be a Perl hash. | [reply] |
|
With a million keys, you should go for the database.
Why? The OP was concerned with speed.
I see this as another (merlyn-style) "bad meme". There are plenty of very good reasons for using a database, but *speed* is not one of them!
Using DB_file is takes over 5 minutes to do what this code does in under 10 seconds. And that is once you've worked out how. The sample code from DB_File does not even compile as printed.
It may be possible to improve that 5 minutes, if you hunt the internet to locate, read, and understand the Berkeley DB optimisation and configuration advice, but you'll never get near direct file access for performance in this application.
Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
| [reply] |
Re: How to remove duplicates from a large set of keys
by Anonymous Monk on Feb 10, 2005 at 10:09 UTC
|
| [reply] |