http://qs321.pair.com?node_id=49198

So, I was running out of disk space on a partition, and I remembered that I had a perl script that would find all duplicated files for me, that I'd found somewhere about 6 or 7 ago, when I was first playing with Perl 4, but didn't really know how to do much.

So I dug it out, read it, and realised how horrible it was. I was tempted to rewrite it, but instead I decided to google for "perl duplicate files" first. I found a couple of other scripts there, but they were pretty horrible too. In particular the first file there, which is basically a comparison between doing in perl vs shell, does a checksum hashing on every file. So I decided I would indeed write my own, which turned out to be about 7 times faster that this one (which was in turn twice as fast as my original script):

#!/usr/bin/perl -w use strict; use File::Find; use Digest::MD5; my %files; my $wasted = 0; find(\&check_file, $ARGV[0] || "."); local $" = ", "; foreach my $size (sort {$b <=> $a} keys %files) { next unless @{$files{$size}} > 1; my %md5; foreach my $file (@{$files{$size}}) { open(FILE, $file) or next; binmode(FILE); push @{$md5{Digest::MD5->new->addfile(*FILE)->hexdigest}},$file; } foreach my $hash (keys %md5) { next unless @{$md5{$hash}} > 1; print "$size: @{$md5{$hash}}\n"; $wasted += $size * (@{$md5{$hash}} - 1); } } 1 while $wasted =~ s/^([-+]?\d+)(\d{3})/$1,$2/; print "$wasted bytes in duplicated files\n"; sub check_file { -f && push @{$files{(stat(_))[7]}}, $File::Find::name; }

Tony