Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Comparing duplicate pictures in different directories

by polettix (Vicar)
on Jun 19, 2005 at 12:38 UTC ( [id://468105]=note: print w/replies, xml ) Need Help??


in reply to Comparing duplicate pictures in different directories

I'd use

diff -r . Duplicates
for this kind of stuff.

If you want to go Perl, I wonder why using SHA1 checksums would make the whole process better than comparing the data into the files directly - speaking of photos, you can probably accept to slurp both files and use the eq operator. Things would change if you saved the checksums for the reference directory in a file that you load at the start of the program: following invocations would avoid reading the original files giving you a benefit.

Also note that you should be able to find modules dealing with SHA1, avoiding to call a subprocess for this.

Flavio
perl -ple'$_=reverse' <<<ti.xittelop@oivalf

Don't fool yourself.

Replies are listed 'Best First'.
Re^2: Comparing duplicate pictures in different directories
by spurperl (Priest) on Jun 19, 2005 at 18:44 UTC
    I would guess that the issue here is runtime ?

    Perhaps 990 images isn't too much, but this method has to be applicable for larger numbers as well. Now, comparing thousands of images that weigh several MBs each may take some time. Hashing seems like a sensible solution...

    You're absolutely right about usage of a module instead of a process - especially since that process is run for each file - eeek !!

      Hashing demands that you read the entire file. Both diff and the File::Compare module read both files block by block, aborting if a difference is detected. I think you'll find that method most efficient, and clearly more efficient than computing a checksum of the entire file.

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

        On the other hand, if the task is to identify sets of files having duplicate content, it would seem most economical to read the full content of each file exactly once in order to compute its checksum, and then identify the file sets that identical checksums.

        If possible checksum collisions -- i.e. false-alarm matches -- is a concern, I think the potential for this is much reduced by supplementing the checksum with the file size (the likelihood of two files having the same size and same checksum, despite having different content, is comfortably small). With that, the old MD5 approach should suffice. So (untested):

        use Digest::MD5 qw(md5_base64); # assume @files contains paths to all files (masters and possible dups +) my %cksum; for my $file ( @files ) { my $size = -s $file; local $/; open( I, $file ); my $md5 = md5_base64( <I> ); push @{$cksum{"$md5 $size"}}, $file; } for ( keys %cksum ) { print "dups: @{$cksum{$_}}\n" if ( @{$cksum{$_}} > 1 ); }
        Granted, if there are relatively few duplications among M masters and N files to test, then applying diff or File::Compare M*N times could be pretty quick. But if there are lots of masters that each have multiple duplications, then diff or File::Compare would have to do a lot of repeated full reads of files to find them all.
Re^2: Comparing duplicate pictures in different directories
by elwarren (Priest) on Jun 21, 2005 at 18:06 UTC
    If you compute the hash you can save it and run it again later when more files are added. If you go the diff or eq route, you have to read the entire file for each compare.

    He doesn't seem to take advantage of hash-caching here, but it would also speed up subsequent compares of the file against the next potential duplicate.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://468105]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-26 03:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found