pr33 has asked for the wisdom of the Perl Monks concerning the following question:
Hello Monks, Looking for some efficient solutions in printing the duplicate files within a directory
Here is my code.
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
##############
my $dir = "$ARGV[0]";
my %md5sum;
my @md5;
my $flag = 0;
my %seen;
opendir(my $dh, $dir) || die "Unable to Open the Directory: $!\n";
chdir $dir or die "Cannot Change directory: $!\n";
while (my $file = readdir $dh) {
chomp $file;
next if $file =~ /^\.+$/g;
if (-f $file) {
my ($md) = (split /\s+/, qx(/usr/bin/md5sum $file))[0];
$md5sum{$file} = $md;
push @md5, $md;
}
}
closedir($dh);
my @uniq = grep { $seen{$_}++ } @md5;
foreach my $k (keys %md5sum) {
foreach my $md (@uniq) {
if ($md eq $md5sum{$k}) {
$flag = 1;
last;
}
}
if ($flag) {
print "$k is a duplicate file with MD5 of $md5sum{$k}\n";
$flag = 0;
}else {
print "$k is not a duplicate file, It's md5sum is $md5sum{$k}\n"
+;
}
}
-bash-3.2$ ./duplicate_files.pl /users/scripts/perl/test/
file2 is a duplicate file with MD5 of d41d8cd98f00b204e9800998ecf8427e
file1 is a duplicate file with MD5 of 5bb062356cddb5d2c0ef41eb2660cb06
file3 is a duplicate file with MD5 of d41d8cd98f00b204e9800998ecf8427e
file4 is a duplicate file with MD5 of d41d8cd98f00b204e9800998ecf8427e
file5 is a duplicate file with MD5 of 5bb062356cddb5d2c0ef41eb2660cb06
file6 is not a duplicate file, It's md5sum is d617c2deabd27ff86ca9825b
+2e7578d4
Re: List Duplicate Files in a given directory
by Laurent_R (Canon) on Jul 31, 2017 at 09:30 UTC
|
Just a side note.
Calculating the MD5 digest (or any other checksum) of a file can take quite a bit of time, especially if the file is large.
And there is no point of computing the MD5 of two files to see whether they're identical if their size isn't the same. And, of course, finding the size of a file is much faster.
So I would suggest that a possible performance enhancement is to compute the MD5 of files only for files that have the same size.
| [reply] |
|
Exactly. That's also what my solution does.
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] |
Re: List Duplicate Files in a given directory
by huck (Prior) on Jul 31, 2017 at 04:43 UTC
|
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
##############
my $md5files={};
my $filesmd5={};
while (my $line=<DATA>){
chomp $line;
my ($fn,$md5)=split(' ',$line,2);
push @{$md5files->{$md5}},$fn;
$filesmd5->{$fn}=$md5;
} # line
for my $md5 (keys %$md5files){
my $md5list=$md5files->{$md5};
if (scalar(@$md5list) == 1 ) { print $md5."\n ".$md5list->[0]."\n"
+;}
else {
print $md5."\n";
for my $file (sort @$md5list){
print ' '.$file."\n";
}
}
} # md5
for my $file (sort keys %$filesmd5) {
my $md5list=$md5files->{$filesmd5->{$file}};
if (scalar(@$md5list) == 1 ) { print $file." is unique\n";}
else {
print $file." is the same as\n";
for my $filed (sort @$md5list){
print ' '.$filed."\n" unless ($file eq $filed);
}
}
} # file
exit;
__DATA__
file2 d41d8cd98f00b204e9800998ecf8427e
file1 5bb062356cddb5d2c0ef41eb2660cb06
file3 d41d8cd98f00b204e9800998ecf8427e
file4 d41d8cd98f00b204e9800998ecf8427e
file5 5bb062356cddb5d2c0ef41eb2660cb06
file6 d617c2deabd27ff86ca9825b2e7578d4
d617c2deabd27ff86ca9825b2e7578d4
file6
d41d8cd98f00b204e9800998ecf8427e
file2
file3
file4
5bb062356cddb5d2c0ef41eb2660cb06
file1
file5
file1 is the same as
file5
file2 is the same as
file3
file4
file3 is the same as
file2
file4
file4 is the same as
file2
file3
file5 is the same as
file1
file6 is unique
| [reply] [d/l] [select] |
|
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
##############
my $dir = "$ARGV[0]";
my %md5sum;
opendir(my $dh, $dir) || die "Unable to Open the Directory: $!\n";
chdir $dir or die "Cannot Change directory: $!\n";
while (my $file = readdir $dh) {
chomp $file;
next if $file =~ /^\.{1,2}$/g;
if (-f $file) {
my ($md) = (split /\s+/, qx(/usr/bin/md5sum $file))[0];
if (exists $md5sum{$md}) {
push @{$md5sum{$md}}, $file;
} else {
push @{$md5sum{$md}}, $file;
}
}
}
closedir($dh);
foreach my $ky (keys %md5sum) {
if (scalar( @{$md5sum{$ky}}) == 1) {
print "Unique File: @{$md5sum{$ky}} , Md5sum: $ky\n";
} else {
print "Duplicate Files: @{$md5sum{$ky}}, Md5sum: $ky\n";
}
}
-bash-3.2$ ./duplicate_files.pl directory
Duplicate Files: file4 file2 file3, Md5sum: d41d8cd98f00b204e9800998e
+cf8427e
Unique File: file6 , Md5sum: d617c2deabd27ff86ca9825b2e7578d4
Duplicate Files: file1 file5, Md5sum: 5bb062356cddb5d2c0ef41eb2660cb0
+6
| [reply] [d/l] [select] |
Re: List Duplicate Files in a given directory
by kevbot (Vicar) on Jul 31, 2017 at 05:18 UTC
|
#!/usr/bin/env perl
use strict;
use warnings;
use Path::Tiny;
my $dir = shift or die 'No directory given';
my $dir_path = path($dir);
unless($dir_path->is_dir){
die "$dir is not a directory";
}
my %files_of;
foreach my $file_path ($dir_path->children){
my $digest = $file_path->digest; # default is SHA-256
#my $digest = $file_path->digest('MD5'); # use this if you want MD
+5
push @{$files_of{$digest}}, $file_path->basename;
}
foreach my $digest (keys %files_of){
my @files = @{$files_of{$digest}};
if( scalar @files > 1){
print join(', ', @files), " are duplicates.\n";
}
}
exit;
| [reply] [d/l] |
Re: List Duplicate Files in a given directory
by Marshall (Canon) on Jul 31, 2017 at 06:18 UTC
|
If I understand your objective correctly, I think you have the hash table backwards. Use the md5sum as the key and push each file name that matches onto an Array of Hash. Something like this (untested):
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
##############
my $dir = shift @ARGV;
my %md5_2file;
opendir(my $dh, $dir) || die "Unable to Open the Directory $dir $!\n";
chdir $dir or die "Cannot Change to directory $dir: $!\n";
while (my $file = readdir $dh) {
if (-f $file) { #takes care of ./ dir's and also no need to chomp(
+)
my ($md) = (split /\s+/, qx(/usr/bin/md5sum $file))[0];
push @{$md5_2file{$md}), $file;
}
}
closedir($dh);
foreach my $md5 (keys %md5_2file)
{
print "$md5: @{md5_2file{$md5}}\n"; # md5: file2 file2 filen
# this md5 is unique if ( @{md5_2file{$md5}} == 1)
}
| [reply] [d/l] |
Re: List Duplicate Files in a given directory
by thanos1983 (Parson) on Jul 31, 2017 at 08:53 UTC
|
Hello pr33,
Well it seems that the fellow monks have provided you wisdom and most likely you have resolved your problem. But just for the record I would suggest also to read a similar question Find duplicate files..
I would also recommend to try also the module File::Find::Duplicates, It gives you the ability to check the md5.
Sample of code from the documentation:
use File::Find::Duplicates;
my @dupes = find_duplicate_files('/basedir1', '/basedir2');
foreach my $dupeset (@dupes) {
printf "Files %s (of size %d) hash to %s\n",
join(", ", @{$dupeset->files}), $dupeset->size, $dupeset->md5;
}
Hope this helps, BR.
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
|
| [reply] |
Re: List Duplicate Files in a given directory
by perlancar (Hermit) on Jul 31, 2017 at 11:12 UTC
|
I see that there are already several solutions posted. Let me add another one :) App::UniqFiles, which can list duplicate or non-duplicate files. As expected, it first checks file size before calculating MD5 hash. You can turn off MD5 checking. And you can also list number of content occurrence (--count, -c) so 1 means the content is unique, 2 means there is one duplicate, and so on.
| [reply] |
Re: List Duplicate Files in a given directory
by talexb (Chancellor) on Jul 31, 2017 at 16:57 UTC
|
These are all great Perl solutions, but you can also just do
md5sum * | sort
and let your eye pick out the pairs of MD5 sums that are identical.
As already noted, if you have a slow machine, and/or gigantic files, the MD5 process may take a while. My very simple test looked at about a dozen text files (tab-delimited tables) that were an average of 10K each in size, and that all took 24 msec. On a set of larger files (90M in total), it still only took 1.7 seconds. Sometimes simple is the best -- it depends on your situation.
Alex / talexb / Toronto
Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.
| [reply] [d/l] |
|
|