So, I was running out of disk space on a partition, and I remembered that I had a perl script that would find all duplicated files for me, that I'd found somewhere about 6 or 7 ago, when I was first playing with Perl 4, but didn't really know how to do much.
So I dug it out, read it, and realised how horrible it
was. I was tempted to rewrite it, but instead I decided
to google for "perl duplicate files" first. I found a couple of other scripts there, but they were pretty horrible too. In particular the first file there, which is basically a comparison between doing in perl vs shell, does a checksum hashing on every file. So I decided I would indeed write my own, which turned out to be about 7 times faster that this one (which was in turn twice as fast as my original script):
#!/usr/bin/perl -w
use strict;
use File::Find;
use Digest::MD5;
my %files;
my $wasted = 0;
find(\&check_file, $ARGV[0] || ".");
local $" = ", ";
foreach my $size (sort {$b <=> $a} keys %files) {
next unless @{$files{$size}} > 1;
my %md5;
foreach my $file (@{$files{$size}}) {
open(FILE, $file) or next;
binmode(FILE);
push @{$md5{Digest::MD5->new->addfile(*FILE)->hexdigest}},$file;
}
foreach my $hash (keys %md5) {
next unless @{$md5{$hash}} > 1;
print "$size: @{$md5{$hash}}\n";
$wasted += $size * (@{$md5{$hash}} - 1);
}
}
1 while $wasted =~ s/^([-+]?\d+)(\d{3})/$1,$2/;
print "$wasted bytes in duplicated files\n";
sub check_file {
-f && push @{$files{(stat(_))[7]}}, $File::Find::name;
}
Tony
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.