Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Comparing duplicate pictures in different directories

by cajun (Chaplain)
on Jun 19, 2005 at 05:37 UTC ( [id://468071]=CUFP: print w/replies, xml ) Need Help??

Recently I learned I had duplicate copies of pictures from my camera on my system. Approximately 990 of them. Rather than compare two pictures side by side, I decided to compute the SHA1 (160 bit) checksums of the 'duplicate' pictures and compare that the SHA1 checksums of the 'master' files.

I'm sure my code could use some polishing. Comments welcome.

Update - Thanks to all for the comments, suggestions to make it work more efficiently. When I wrote this, it was written as a one time shot. The duplicate files got there in the first place through some wizardry of BOFH.

#!/usr/bin/perl -w use strict; use File::Find; my $dupedir ='/Big-Drive/NIKON-Pictures/Duplicates/'; my $count; my $calccount; my @files; my @calcsum; my $compared; open(OUT,">results.txt") || die "Can't open file $!"; find(\&files, $dupedir); &calcsum; # &printsums; &comparefiles; print "$count total files found\n"; print "$calccount total files calculated SHA1 sums\n"; print "$compared total files compared to originals\n"; sub files { return unless -f $File::Find::name; $count++; push (@files, "$File::Find::name"); } sub calcsum{ foreach(@files){ print "Computing sha1sum for $_\n"; push(@calcsum, `sha1sum $_`); $calccount++; } } sub printsums{ foreach(@calcsum){ print; } } sub comparefiles{ my $sum; my $file; my $calcsum; my $rest; foreach (@calcsum){ ($sum, $file)=split /\s+/; $file =~ s/Duplicates\///; if ( -f $file){ print "Calculating SHA1 checksum for file $file"; print OUT "Calculating SHA1 checksum for file $file"; ($calcsum, $rest)=split /\s+/,`sha1sum $file`; if ( $calcsum eq $sum){ print " ----> OK !\n"; print OUT " ----> OK !\n"; $compared++; }else{ print "\n****** ERROR ****** Checksums do not match for $file\ +n "; print OUT "\n****** ERROR ****** Checksums do not match for $f +ile\n "; } }else{ print "$file not in master directory ... skipping\n"; print OUT "$file not in master directory ... skipping\n"; } } } close OUT;

Replies are listed 'Best First'.
Re: Comparing duplicate pictures in different directories
by polettix (Vicar) on Jun 19, 2005 at 12:38 UTC

    I'd use

    diff -r . Duplicates
    for this kind of stuff.

    If you want to go Perl, I wonder why using SHA1 checksums would make the whole process better than comparing the data into the files directly - speaking of photos, you can probably accept to slurp both files and use the eq operator. Things would change if you saved the checksums for the reference directory in a file that you load at the start of the program: following invocations would avoid reading the original files giving you a benefit.

    Also note that you should be able to find modules dealing with SHA1, avoiding to call a subprocess for this.

    Flavio
    perl -ple'$_=reverse' <<<ti.xittelop@oivalf

    Don't fool yourself.
      I would guess that the issue here is runtime ?

      Perhaps 990 images isn't too much, but this method has to be applicable for larger numbers as well. Now, comparing thousands of images that weigh several MBs each may take some time. Hashing seems like a sensible solution...

      You're absolutely right about usage of a module instead of a process - especially since that process is run for each file - eeek !!

        Hashing demands that you read the entire file. Both diff and the File::Compare module read both files block by block, aborting if a difference is detected. I think you'll find that method most efficient, and clearly more efficient than computing a checksum of the entire file.

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

      If you compute the hash you can save it and run it again later when more files are added. If you go the diff or eq route, you have to read the entire file for each compare.

      He doesn't seem to take advantage of hash-caching here, but it would also speed up subsequent compares of the file against the next potential duplicate.
Re: Comparing duplicate pictures in different directories
by hawtin (Prior) on Jun 19, 2005 at 14:46 UTC

    Good one

    I have something that does a similar job, however it works in a slightly different way. First of all the assumption is that the exact size of the image gives a quicker hint than the checksum so it keeps a hash that translates file sizes into a list of names and only looks at the contents when possible matches are detected.

    This lets me specify a root directory, find candidates for cleaning up and interactively delete them using Tk.

    To answer the other question I am running this under Windows (and I am not allowed to install a real environment) so diff is not available.

      Sounds like everyone has, so I'll chip in how I did it in *my* version :-)

      I populate a hash with a filelist and sort based on size, so that I don't need to compare two files with the same filesize. But since I was comparing images from my webcam that were very small, there tends to be many files sized the same 87kb or whatever, so I still had do the hashing...

      Oh yeah, and the first thing to do after getting all the filesizes was to get rid of the zero byte or corrupt files...
Re: Comparing duplicate pictures in different directories
by Jenda (Abbot) on Jun 23, 2005 at 23:15 UTC

    If I understand your code right you are only looking for files that not only have the same content, but also the same name. I don't think that's too common, though YMMV.

    Here's my version

    It scans several directories (with subdirectories), creates the MD5 hashes of the files, stores them into a hash and reports duplicities. With certain parameters it even automaticaly deletes some duplicates. The duplicate images are opened in an image viewer so that I can choose which one to delete based on the name and path. It's Windows only, but I think it would be no big deal to change that to Unix only. It's just that I need to create two processes and wait till they both exit. Which AFAIK has to be implemented differently on the two OSes.

    Jenda
    XML sucks. Badly. SOAP on the other hand is the most powerfull vacuum pump ever invented.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://468071]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-03-28 20:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found