Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

How to remove duplicates from a large set of keys

by nite_man (Deacon)
on Feb 10, 2005 at 08:06 UTC ( [id://429615]=perlquestion: print w/replies, xml ) Need Help??

nite_man has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks!

I have a problem with finding dublicated values in the big set of keys. So, let's imagine. I have the set of values. It's about 1 million. In real time I should check does a new value exist in my set and if not to add it.

One way is to use a hash. But it takes a lot of time when number of values is increased.

Other way is store values into a database and make unique index by values. So, when a new value is coming I can try to insert it into a table and catch exceptions.

I'd like to hear your opinions and suggestions about it.

TIA

---
Michael Stepanov aka nite_man

It's only my opinion and it doesn't have pretensions of absoluteness!

Retitled by davido from 'Finding duplicated values'.

  • Comment on How to remove duplicates from a large set of keys

Replies are listed 'Best First'.
Re: How to remove duplicates from a large set of keys
by Corion (Patriarch) on Feb 10, 2005 at 08:16 UTC

    In the end, you will still need to have all keys in memory, or at least accessible, that means, you will need some kind of hash, either a plain (in memory) hash, or a tied hash, that you tie to a dbm file for example.

    You can possibly save on the side of the keys, by generating the checksums for the keys yourself, by using Digest::MD5 or something comparable, but that will only help you as long as your keys are on average longer than their MD5 length. You can also consider building a trie of your keys by building a linked list of keys with a common start, either a letter or a string. This increases the number of lookups you need to make, but can reduce the amount of memory you need, of your keys are long enough and have enough common prefixes. Still, a million keys shouldn't eat too much memory - about 1 million*32 bytes for the hash entries, plus the length of the keys in bytes.

      Thanks for your repaly, Corion.

      In the end, you will still need to have all keys in memory, or at least accessible
      Why, in case of using a database I can just try to insert a new value. If that value is already exists in the table I'll get an exception 'Cannot insert a duplicated value bla-bla-bla'. But otherwise a new value will be inserted the the table.

      a million keys shouldn't eat too much memory
      The most important criterion for me is a speed of processing of new values. I haven't use databse approach yet but in case of using a hash a processing of one value takes about 40 seconds with 1 million hash keys. But the number of keys is increased and the time increased too.

      ---
      Michael Stepanov aka nite_man

      It's only my opinion and it doesn't have pretensions of absoluteness!

        Whether you have your million records in memory (fast) or on disk in a database (slow), you have to take the time to insert your new data. Looking up existing data is different - as explained, looking up in a hash is O(1): you take the key, perform a calculation on it (which is dependant on the length of the key, not the size of the hash), and go to that entry in the (associative) array. Looking up in a database cannot be any faster than O(1). It can be as bad as O(log N) (I can't imagine any database doing an index lookup any slower than a binary search), which is dependant on the number of data points you're comparing to.

        The only way that a database could be faster is if it's a big honkin' box with lots of RAM, and that's a different box from your perl client.

        This problem is one of the primary reasons to use a hash. (Not the only one, but one of them nonetheless.)

Re: How to remove duplicates from a large set of keys
by demerphq (Chancellor) on Feb 10, 2005 at 08:46 UTC

    Put the keys in a file, sort them by key at the file level (there are any number of tools to do this). Scan the file, any duplicates will be adjacent so just skip them as you scan. You may even find that the sort program will remove the dupes for you.

    ---
    demerphq

      Please read the entire posting before shooting of a reply. Had you bothered to read the entire first paragraph (I know, I know, it's three lines, waaaaaaay too long), you could have read:
      In real time I should check does a new value exist in my set and if not to add it.
      Resorting a million record file each time a record in added isn't very efficient.
Re: How to remove duplicates from a large set of keys
by BrowserUk (Patriarch) on Feb 10, 2005 at 11:33 UTC

    What is the overall range, ie. highest and lowest values of your numbers?

    If the range is something reasonable, you can store 1 bit/per number in the overall range and use vec to test/set those bits. If you need the check persistant, save the bitstring in a file.

    For a range 0 to 1,000,000, you just need 1/4 MB and it will be very fast.

    Even if your numbers are say 10 digit numbers, this still only requires 120MB in memory or on disk, and is still extremly fast.

    If the range is greater than you can comforatably fit in memory, but within acceptable limits for a diskfile ( unsigned 32-bit ints requires .5 GB), then doing the math, reading/checking/setting/re-writing the appropriate bit&byte is still pretty quick.

    The following code reads and writes individual bytes for every bit and can deal with checking a million (setting 600,000 of them) in just over 10 seconds. If they are already set, that time drops to 6.5 seconds.

    You could try adding some buffering if your numbers tend to run in batches rather than being completely random, but the buffering logic tends to slow things in many cases. You could try ':byte' and read/seek/write instead of ':raw' and sys* and see if PerlIO buffering buys you anything.

    Some crude code & timings:


    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.
Re: How to remove duplicates from a large set of keys
by jbrugger (Parson) on Feb 10, 2005 at 08:34 UTC
    I'd create a lookup table:
    #pseudo: my %looupTable; if (!$lookupTable{$yourkeyname}) { $lookupTable{$yourkeyname} = 1; }

    A million keys should run fine on a decent machine.

      Unfortunately, lookup in the hash with more than million keys takes a lot of time. I suspect that there is a way to do it more efficient.

      ---
      Michael Stepanov aka nite_man

      It's only my opinion and it doesn't have pretensions of absoluteness!

        Lookup in a hash of 1 million keys is rougly the same as lookup in a hash of 10 keys. :-) (Assuming you are still inside of physical memory.)

        Its creating the hash thats the problem. It takes a long time, especially if you dont know how many records you are storing up front.

        ---
        demerphq

        Is it?
        This script took 6 seconds to complete (run), with 2 million keys, on a Celeron 2.4 Ghz, 512 Mb, running X, Mozilla, Apache, john, etc. etc. and took at max. 27 % of the memory
        #!/usr/bin/perl -w use strict; my %lookupTable; for (my $i=0; $i<=2000000; $i++) { if (!$lookupTable{$i}) { $lookupTable{$i} = 1; } }
Re: How to remove duplicates from a large set of keys
by blazar (Canon) on Feb 10, 2005 at 08:59 UTC
    One way is to use a hash. But it takes a lot of time when number of values is increased.
    I'm not sure if this is a good answer to your problem, but it may indeed depend on how you store the key-value pairs in you hash. You may be interested in the use of keys as an lvalue.
Re: How to remove duplicates from a large set of keys
by cog (Parson) on Feb 10, 2005 at 09:25 UTC
    Another idea would be to do it several times... like, in the first time, you would only treat keys starting with an "A", in the second time, with a "B"...

    Readjust to fit your particular keys :-)

Re: How to remove duplicates from a large set of keys
by Thilosophy (Curate) on Feb 10, 2005 at 11:37 UTC
    With a million keys, you should go for the database.

    If you only need hash lookups (as opposed to all the query stuff you can do with a relational database), you could give BerkeleyDB a try, which even ships with Perl (as the DB_File module). Instead of using a query language to manipulate data, it pretends to be a Perl hash.

      With a million keys, you should go for the database.

      Why? The OP was concerned with speed.

      I see this as another (merlyn-style) "bad meme". There are plenty of very good reasons for using a database, but *speed* is not one of them!

      Using DB_file is takes over 5 minutes to do what this code does in under 10 seconds. And that is once you've worked out how. The sample code from DB_File does not even compile as printed.

      It may be possible to improve that 5 minutes, if you hunt the internet to locate, read, and understand the Berkeley DB optimisation and configuration advice, but you'll never get near direct file access for performance in this application.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
Re: How to remove duplicates from a large set of keys
by Anonymous Monk on Feb 10, 2005 at 10:09 UTC
    Both a database and a hash will do the trick. Which one is better for your application depends on a lot of things. Up to a certain number of keys, a hash will be faster. But if the number of keys grows, Perl will use more and more memory, and at a certain point, it will start trashing, and eventually run out of memory. Somewhere along the way, it's going to be more efficient to use a database, specially if the database is on another box. But where that point is depends on the number of keys, the number of inserts, the number of queries, and the amount of memory your program has available.

    It's something only you can test.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://429615]
Approved by jbrugger
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-25 10:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found