List Duplicate Files in a given directory

pr33 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: List Duplicate Files in a given directory by Laurent_R (Canon) on Jul 31, 2017 at 09:30 UTC
Just a side note. Calculating the MD5 digest (or any other checksum) of a file can take quite a bit of time, especially if the file is large. And there is no point of computing the MD5 of two files to see whether they're identical if their size isn't the same. And, of course, finding the size of a file is much faster. So I would suggest that a possible performance enhancement is to compute the MD5 of files only for files that have the same size.	[reply]
Re^2: List Duplicate Files in a given directory by choroba (Cardinal) on Jul 31, 2017 at 11:08 UTC
Exactly. That's also what my solution does. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l]
Re: List Duplicate Files in a given directory by huck (Prior) on Jul 31, 2017 at 04:43 UTC
When i do this i tend to be more interested in what files match so do one of these two paths #!/usr/bin/perl use warnings; use strict; use Data::Dumper; ############## my $md5files={}; my $filesmd5={}; while (my $line=<DATA>){ chomp $line; my ($fn,$md5)=split(' ',$line,2); push @{$md5files->{$md5}},$fn; $filesmd5->{$fn}=$md5; } # line for my $md5 (keys %$md5files){ my $md5list=$md5files->{$md5}; if (scalar(@$md5list) == 1 ) { print $md5."\n ".$md5list->[0]."\n" +;} else { print $md5."\n"; for my $file (sort @$md5list){ print ' '.$file."\n"; } } } # md5 for my $file (sort keys %$filesmd5) { my $md5list=$md5files->{$filesmd5->{$file}}; if (scalar(@$md5list) == 1 ) { print $file." is unique\n";} else { print $file." is the same as\n"; for my $filed (sort @$md5list){ print ' '.$filed."\n" unless ($file eq $filed); } } } # file exit; __DATA__ file2 d41d8cd98f00b204e9800998ecf8427e file1 5bb062356cddb5d2c0ef41eb2660cb06 file3 d41d8cd98f00b204e9800998ecf8427e file4 d41d8cd98f00b204e9800998ecf8427e file5 5bb062356cddb5d2c0ef41eb2660cb06 file6 d617c2deabd27ff86ca9825b2e7578d4 [download] `d617c2deabd27ff86ca9825b2e7578d4 file6 d41d8cd98f00b204e9800998ecf8427e file2 file3 file4 5bb062356cddb5d2c0ef41eb2660cb06 file1 file5 file1 is the same as file5 file2 is the same as file3 file4 file3 is the same as file2 file4 file4 is the same as file2 file3 file5 is the same as file1 file6 is unique` [download]	[reply] [d/l] [select]
Re^2: List Duplicate Files in a given directory by pr33 (Scribe) on Jul 31, 2017 at 05:57 UTC
Thanks Huck and Keybot . From your solution, It is clear to use md5dum's as the keys of the hash and then hash value as a reference to an array of files . I have changed my original code to create a Hash of an Array to do this. #!/usr/bin/perl use warnings; use strict; use Data::Dumper; ############## my $dir = "$ARGV[0]"; my %md5sum; opendir(my $dh, $dir) \|\| die "Unable to Open the Directory: $!\n"; chdir $dir or die "Cannot Change directory: $!\n"; while (my $file = readdir $dh) { chomp $file; next if $file =~ /^\.{1,2}$/g; if (-f $file) { my ($md) = (split /\s+/, qx(/usr/bin/md5sum $file))[0]; if (exists $md5sum{$md}) { push @{$md5sum{$md}}, $file; } else { push @{$md5sum{$md}}, $file; } } } closedir($dh); foreach my $ky (keys %md5sum) { if (scalar( @{$md5sum{$ky}}) == 1) { print "Unique File: @{$md5sum{$ky}} , Md5sum: $ky\n"; } else { print "Duplicate Files: @{$md5sum{$ky}}, Md5sum: $ky\n"; } } [download] `-bash-3.2$ ./duplicate_files.pl directory Duplicate Files: file4 file2 file3, Md5sum: d41d8cd98f00b204e9800998e +cf8427e Unique File: file6 , Md5sum: d617c2deabd27ff86ca9825b2e7578d4 Duplicate Files: file1 file5, Md5sum: 5bb062356cddb5d2c0ef41eb2660cb0 +6` [download]	[reply] [d/l] [select]
Re: List Duplicate Files in a given directory by kevbot (Vicar) on Jul 31, 2017 at 05:18 UTC
Here is a solution that uses Path::Tiny. See Path::Tiny: The little module that keeps on giving for a nice introduction to Path::Tiny. #!/usr/bin/env perl use strict; use warnings; use Path::Tiny; my $dir = shift or die 'No directory given'; my $dir_path = path($dir); unless($dir_path->is_dir){ die "$dir is not a directory"; } my %files_of; foreach my $file_path ($dir_path->children){ my $digest = $file_path->digest; # default is SHA-256 #my $digest = $file_path->digest('MD5'); # use this if you want MD +5 push @{$files_of{$digest}}, $file_path->basename; } foreach my $digest (keys %files_of){ my @files = @{$files_of{$digest}}; if( scalar @files > 1){ print join(', ', @files), " are duplicates.\n"; } } exit; [download]	[reply] [d/l]
Re: List Duplicate Files in a given directory by Marshall (Canon) on Jul 31, 2017 at 06:18 UTC
If I understand your objective correctly, I think you have the hash table backwards. Use the md5sum as the key and push each file name that matches onto an Array of Hash. Something like this (untested): #!/usr/bin/perl use warnings; use strict; use Data::Dumper; ############## my $dir = shift @ARGV; my %md5_2file; opendir(my $dh, $dir) \|\| die "Unable to Open the Directory $dir $!\n"; chdir $dir or die "Cannot Change to directory $dir: $!\n"; while (my $file = readdir $dh) { if (-f $file) { #takes care of ./ dir's and also no need to chomp( +) my ($md) = (split /\s+/, qx(/usr/bin/md5sum $file))[0]; push @{$md5_2file{$md}), $file; } } closedir($dh); foreach my $md5 (keys %md5_2file) { print "$md5: @{md5_2file{$md5}}\n"; # md5: file2 file2 filen # this md5 is unique if ( @{md5_2file{$md5}} == 1) } [download]	[reply] [d/l]
Re: List Duplicate Files in a given directory by thanos1983 (Parson) on Jul 31, 2017 at 08:53 UTC
Hello pr33, Well it seems that the fellow monks have provided you wisdom and most likely you have resolved your problem. But just for the record I would suggest also to read a similar question Find duplicate files.. I would also recommend to try also the module File::Find::Duplicates, It gives you the ability to check the md5. Sample of code from the documentation: `use File::Find::Duplicates; my @dupes = find_duplicate_files('/basedir1', '/basedir2'); foreach my $dupeset (@dupes) { printf "Files %s (of size %d) hash to %s\n", join(", ", @{$dupeset->files}), $dupeset->size, $dupeset->md5; }` [download] Hope this helps, BR. Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re^2: List Duplicate Files in a given directory by pr33 (Scribe) on Jul 31, 2017 at 16:59 UTC
Thanks Thanos and thanks to all the Monks who have helped with this . Tried your solution and Choroba's . It is much faster especially when the directory has large number of files to compare as well as files with larger size.	[reply]
Re: List Duplicate Files in a given directory by perlancar (Hermit) on Jul 31, 2017 at 11:12 UTC
I see that there are already several solutions posted. Let me add another one :) App::UniqFiles, which can list duplicate or non-duplicate files. As expected, it first checks file size before calculating MD5 hash. You can turn off MD5 checking. And you can also list number of content occurrence (--count, -c) so 1 means the content is unique, 2 means there is one duplicate, and so on.	[reply]
Re: List Duplicate Files in a given directory by talexb (Chancellor) on Jul 31, 2017 at 16:57 UTC
These are all great Perl solutions, but you can also just do `md5sum * \| sort` [download] and let your eye pick out the pairs of MD5 sums that are identical. As already noted, if you have a slow machine, and/or gigantic files, the MD5 process may take a while. My very simple test looked at about a dozen text files (tab-delimited tables) that were an average of 10K each in size, and that all took 24 msec. On a set of larger files (90M in total), it still only took 1.7 seconds. Sometimes simple is the best -- it depends on your situation. Alex / talexb / Toronto Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.	[reply] [d/l]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks