Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

How to make a fingerprint from an Object

by jeanluca (Deacon)
on May 07, 2007 at 13:44 UTC ( [id://613937]=perlquestion: print w/replies, xml ) Need Help??

jeanluca has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

I have a new problem, which is challenging of course, but when implementing the solution it would be nice to know that I've chosen the correct one!
The tool I'm writing has to tell, out of many objects (all instances from the same package), which objects are duplicates. In a loop I can compare the data from 2 objects at the time and determine which objects I have to remove. However this will be too slow, those objects contain a lot of ASCII data! (performance is very important).

The best solution (I think) would be to create a fingerprint once per object which is used in the comparison!
But what is the fastest way to make fingerprints, MD5 ?

Thnx
LuCa
  • Comment on How to make a fingerprint from an Object

Replies are listed 'Best First'.
Re: How to make a fingerprint from an Object
by nothingmuch (Priest) on May 07, 2007 at 14:01 UTC
    If the objects are "simple" then concatenate all the fields and run that through Digest. Then cache the result. Be sure to concatenate hash values sorted by key if you intend to use the digests between instances of perl, since the hashing order changes per invocation. If they have deep nested structures look at Data::Structure::Util, or investinage Object::Signature which uses Storable under the hood, and returns a digest of the serialized data. Likewise, cache the result in an additional attribute of the object.
    -nuffin
    zz zZ Z Z #!perl
Re: How to make a fingerprint from an Object
by Moron (Curate) on May 07, 2007 at 17:44 UTC
    You could use Digest::MD5 but that only operates on one string. In general, an object is a blessed reference to a compound data structure which doesn't automatically fit. So the real challenge is to figure out a unique way to convert the data structure to a single string that can be converted to MD5. Data::Dumper won't can guarantee a unique key ordering (see reply below from tinita). , so you'd need to sort the structure by hash key before delimiting, although arrays should be used in existing order ( $; = ASCI(19), non-printing, is also a useful delimiter for printable data ).

    The only thing that occurs to me though is that fingerprinting the whole object shouldn't be necessary - it should be sufficient to fingerprint an ordered, delimited concatenation of selected instance fields. It is more usual (though not necessarily mandatory) for this to be the primary key in the logical data model rather than some bulk data field.

    Update: You could also consider storing the MD5 in the database, setting a UNIQUE constraint on some selection of fields and letting the database deal with the problem.

    __________________________________________________________________________________

    ^M Free your mind!

      Data::Dumper won't guarantee a unique key ordering
      oh, it does =)
      just set $Data::Dumper::Sortkeys to 1
Re: How to make a fingerprint from an Object
by doom (Deacon) on May 08, 2007 at 00:03 UTC
    You don't make it entirely clear what you mean by a "duplicate" object. If you're interested in finding objects with identical data, then yes, an MD5 fingerprint (updated whenever the data was last changed) would be a decent solution. If you're looking for copies of the same object, then you can just check scalar $object.

      Of course, you might want to do a bit more data analysis before you just jump into using a hash, since the process of creating the hash will, by definition, take longer than it would to simply examine the two structures once. So whether or not the hash will speed up your code depends on what you want to do with it.

      That said, MD5 is a good hashing algorithm. The number of false collisions is small, the algorithm itself is fairly simple (and fast) and if your structure is significantly bigger than the hash key size, the process of comparing many hash keys might be a lot faster than that of comparing many structures. We use it where I work to shortcut the parsing of some multi-megabyte structures, if the user asks to include them many times.

Re: How to make a fingerprint from an Object
by rblasch (Monk) on May 08, 2007 at 16:48 UTC
    Have a look at the source of File::Find::Duplicates. Maybe you can adapt the way it finds duplicate files to your needs.
      It seems using Digest::MD5 is the way to go, however I see a big difference between taking a MD5 from an Object or a file.
      For example, you might not want to use all the data from the objects for the fingerprint!

      LuCa
Re: How to make a fingerprint from an Object
by valdez (Monsignor) on May 22, 2007 at 12:52 UTC

    I would use Data::UUID to generate a universal unique identifier during object creation; you would store that object identifier and use it later for comparison; in fact there is no reason to compute object fingerprint later.

    package Class; use strict; use warnings; use Data::UUID; sub new { my $class = shift; my $uuid = Data::UUID->new; return bless { some => 'data', object_signature => $uuid->create_str(), }, $class; } sub object_signature { shift->{object_signature}; } 1;

    HTH, Valerio

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://613937]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-16 17:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found