Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

List Duplicate Files in a given directory

by pr33 (Scribe)
on Jul 31, 2017 at 04:02 UTC ( [id://1196324]=perlquestion: print w/replies, xml ) Need Help??

pr33 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, Looking for some efficient solutions in printing the duplicate files within a directory

Here is my code.

#!/usr/bin/perl use warnings; use strict; use Data::Dumper; ############## my $dir = "$ARGV[0]"; my %md5sum; my @md5; my $flag = 0; my %seen; opendir(my $dh, $dir) || die "Unable to Open the Directory: $!\n"; chdir $dir or die "Cannot Change directory: $!\n"; while (my $file = readdir $dh) { chomp $file; next if $file =~ /^\.+$/g; if (-f $file) { my ($md) = (split /\s+/, qx(/usr/bin/md5sum $file))[0]; $md5sum{$file} = $md; push @md5, $md; } } closedir($dh); my @uniq = grep { $seen{$_}++ } @md5; foreach my $k (keys %md5sum) { foreach my $md (@uniq) { if ($md eq $md5sum{$k}) { $flag = 1; last; } } if ($flag) { print "$k is a duplicate file with MD5 of $md5sum{$k}\n"; $flag = 0; }else { print "$k is not a duplicate file, It's md5sum is $md5sum{$k}\n" +; } }
-bash-3.2$ ./duplicate_files.pl /users/scripts/perl/test/ file2 is a duplicate file with MD5 of d41d8cd98f00b204e9800998ecf8427e file1 is a duplicate file with MD5 of 5bb062356cddb5d2c0ef41eb2660cb06 file3 is a duplicate file with MD5 of d41d8cd98f00b204e9800998ecf8427e file4 is a duplicate file with MD5 of d41d8cd98f00b204e9800998ecf8427e file5 is a duplicate file with MD5 of 5bb062356cddb5d2c0ef41eb2660cb06 file6 is not a duplicate file, It's md5sum is d617c2deabd27ff86ca9825b +2e7578d4

Replies are listed 'Best First'.
Re: List Duplicate Files in a given directory
by Laurent_R (Canon) on Jul 31, 2017 at 09:30 UTC
    Just a side note.

    Calculating the MD5 digest (or any other checksum) of a file can take quite a bit of time, especially if the file is large.

    And there is no point of computing the MD5 of two files to see whether they're identical if their size isn't the same. And, of course, finding the size of a file is much faster.

    So I would suggest that a possible performance enhancement is to compute the MD5 of files only for files that have the same size.

      Exactly. That's also what my solution does.

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: List Duplicate Files in a given directory
by huck (Prior) on Jul 31, 2017 at 04:43 UTC

    When i do this i tend to be more interested in what files match so do one of these two paths

    #!/usr/bin/perl use warnings; use strict; use Data::Dumper; ############## my $md5files={}; my $filesmd5={}; while (my $line=<DATA>){ chomp $line; my ($fn,$md5)=split(' ',$line,2); push @{$md5files->{$md5}},$fn; $filesmd5->{$fn}=$md5; } # line for my $md5 (keys %$md5files){ my $md5list=$md5files->{$md5}; if (scalar(@$md5list) == 1 ) { print $md5."\n ".$md5list->[0]."\n" +;} else { print $md5."\n"; for my $file (sort @$md5list){ print ' '.$file."\n"; } } } # md5 for my $file (sort keys %$filesmd5) { my $md5list=$md5files->{$filesmd5->{$file}}; if (scalar(@$md5list) == 1 ) { print $file." is unique\n";} else { print $file." is the same as\n"; for my $filed (sort @$md5list){ print ' '.$filed."\n" unless ($file eq $filed); } } } # file exit; __DATA__ file2 d41d8cd98f00b204e9800998ecf8427e file1 5bb062356cddb5d2c0ef41eb2660cb06 file3 d41d8cd98f00b204e9800998ecf8427e file4 d41d8cd98f00b204e9800998ecf8427e file5 5bb062356cddb5d2c0ef41eb2660cb06 file6 d617c2deabd27ff86ca9825b2e7578d4
    d617c2deabd27ff86ca9825b2e7578d4 file6 d41d8cd98f00b204e9800998ecf8427e file2 file3 file4 5bb062356cddb5d2c0ef41eb2660cb06 file1 file5 file1 is the same as file5 file2 is the same as file3 file4 file3 is the same as file2 file4 file4 is the same as file2 file3 file5 is the same as file1 file6 is unique

      Thanks Huck and Keybot . From your solution, It is clear to use md5dum's as the keys of the hash and then hash value as a reference to an array of files . I have changed my original code to create a Hash of an Array to do this.

      #!/usr/bin/perl use warnings; use strict; use Data::Dumper; ############## my $dir = "$ARGV[0]"; my %md5sum; opendir(my $dh, $dir) || die "Unable to Open the Directory: $!\n"; chdir $dir or die "Cannot Change directory: $!\n"; while (my $file = readdir $dh) { chomp $file; next if $file =~ /^\.{1,2}$/g; if (-f $file) { my ($md) = (split /\s+/, qx(/usr/bin/md5sum $file))[0]; if (exists $md5sum{$md}) { push @{$md5sum{$md}}, $file; } else { push @{$md5sum{$md}}, $file; } } } closedir($dh); foreach my $ky (keys %md5sum) { if (scalar( @{$md5sum{$ky}}) == 1) { print "Unique File: @{$md5sum{$ky}} , Md5sum: $ky\n"; } else { print "Duplicate Files: @{$md5sum{$ky}}, Md5sum: $ky\n"; } }
      -bash-3.2$ ./duplicate_files.pl directory Duplicate Files: file4 file2 file3, Md5sum: d41d8cd98f00b204e9800998e +cf8427e Unique File: file6 , Md5sum: d617c2deabd27ff86ca9825b2e7578d4 Duplicate Files: file1 file5, Md5sum: 5bb062356cddb5d2c0ef41eb2660cb0 +6
Re: List Duplicate Files in a given directory
by kevbot (Vicar) on Jul 31, 2017 at 05:18 UTC
    Here is a solution that uses Path::Tiny. See Path::Tiny: The little module that keeps on giving for a nice introduction to Path::Tiny.
    #!/usr/bin/env perl use strict; use warnings; use Path::Tiny; my $dir = shift or die 'No directory given'; my $dir_path = path($dir); unless($dir_path->is_dir){ die "$dir is not a directory"; } my %files_of; foreach my $file_path ($dir_path->children){ my $digest = $file_path->digest; # default is SHA-256 #my $digest = $file_path->digest('MD5'); # use this if you want MD +5 push @{$files_of{$digest}}, $file_path->basename; } foreach my $digest (keys %files_of){ my @files = @{$files_of{$digest}}; if( scalar @files > 1){ print join(', ', @files), " are duplicates.\n"; } } exit;
Re: List Duplicate Files in a given directory
by Marshall (Canon) on Jul 31, 2017 at 06:18 UTC
    If I understand your objective correctly, I think you have the hash table backwards. Use the md5sum as the key and push each file name that matches onto an Array of Hash. Something like this (untested):
    #!/usr/bin/perl use warnings; use strict; use Data::Dumper; ############## my $dir = shift @ARGV; my %md5_2file; opendir(my $dh, $dir) || die "Unable to Open the Directory $dir $!\n"; chdir $dir or die "Cannot Change to directory $dir: $!\n"; while (my $file = readdir $dh) { if (-f $file) { #takes care of ./ dir's and also no need to chomp( +) my ($md) = (split /\s+/, qx(/usr/bin/md5sum $file))[0]; push @{$md5_2file{$md}), $file; } } closedir($dh); foreach my $md5 (keys %md5_2file) { print "$md5: @{md5_2file{$md5}}\n"; # md5: file2 file2 filen # this md5 is unique if ( @{md5_2file{$md5}} == 1) }
Re: List Duplicate Files in a given directory
by thanos1983 (Parson) on Jul 31, 2017 at 08:53 UTC

    Hello pr33,

    Well it seems that the fellow monks have provided you wisdom and most likely you have resolved your problem. But just for the record I would suggest also to read a similar question Find duplicate files..

    I would also recommend to try also the module File::Find::Duplicates, It gives you the ability to check the md5.

    Sample of code from the documentation:

    use File::Find::Duplicates; my @dupes = find_duplicate_files('/basedir1', '/basedir2'); foreach my $dupeset (@dupes) { printf "Files %s (of size %d) hash to %s\n", join(", ", @{$dupeset->files}), $dupeset->size, $dupeset->md5; }

    Hope this helps, BR.

    Seeking for Perl wisdom...on the process of learning...not there...yet!

      Thanks Thanos and thanks to all the Monks who have helped with this . Tried your solution and Choroba's . It is much faster especially when the directory has large number of files to compare as well as files with larger size.

Re: List Duplicate Files in a given directory
by perlancar (Hermit) on Jul 31, 2017 at 11:12 UTC
    I see that there are already several solutions posted. Let me add another one :) App::UniqFiles, which can list duplicate or non-duplicate files. As expected, it first checks file size before calculating MD5 hash. You can turn off MD5 checking. And you can also list number of content occurrence (--count, -c) so 1 means the content is unique, 2 means there is one duplicate, and so on.
Re: List Duplicate Files in a given directory
by talexb (Chancellor) on Jul 31, 2017 at 16:57 UTC

    These are all great Perl solutions, but you can also just do

    and let your eye pick out the pairs of MD5 sums that are identical.

    As already noted, if you have a slow machine, and/or gigantic files, the MD5 process may take a while. My very simple test looked at about a dozen text files (tab-delimited tables) that were an average of 10K each in size, and that all took 24 msec. On a set of larger files (90M in total), it still only took 1.7 seconds. Sometimes simple is the best -- it depends on your situation.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1196324]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2024-04-19 16:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found