Re^3: Comparing duplicate pictures in different directories

Hashing demands that you read the entire file. Both diff and the File::Compare module read both files block by block, aborting if a difference is detected. I think you'll find that method most efficient, and clearly more efficient than computing a checksum of the entire file.

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.

Comment on Re^3: Comparing duplicate pictures in different directories

Replies are listed 'Best First'.
Re^4: Comparing duplicate pictures in different directories by graff (Chancellor) on Jun 19, 2005 at 22:58 UTC
On the other hand, if the task is to identify sets of files having duplicate content, it would seem most economical to read the full content of each file exactly once in order to compute its checksum, and then identify the file sets that identical checksums. If possible checksum collisions -- i.e. false-alarm matches -- is a concern, I think the potential for this is much reduced by supplementing the checksum with the file size (the likelihood of two files having the same size and same checksum, despite having different content, is comfortably small). With that, the old MD5 approach should suffice. So (untested): `use Digest::MD5 qw(md5_base64); # assume @files contains paths to all files (masters and possible dups +) my %cksum; for my $file ( @files ) { my $size = -s $file; local $/; open( I, $file ); my $md5 = md5_base64( <I> ); push @{$cksum{"$md5 $size"}}, $file; } for ( keys %cksum ) { print "dups: @{$cksum{$_}}\n" if ( @{$cksum{$_}} > 1 ); }` [download] Granted, if there are relatively few duplications among M masters and N files to test, then applying diff or File::Compare M*N times could be pretty quick. But if there are lots of masters that each have multiple duplications, then diff or File::Compare would have to do a lot of repeated full reads of files to find them all.	[reply] [d/l]
Re^5: Comparing duplicate pictures in different directories by merlyn (Sage) on Jun 20, 2005 at 00:34 UTC
The original task was only pairwise between two directories. Read up the thread. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]


Welcome to the Monastery
	PerlMonks