thanks to lemming's code for generating md5 hashes above, It became the first part in finding duplicates for me. I used the following code to find duplicates and show them. Running the same code again with 'remove' will 'move' all the duplicates to a ./trash/ subdirectory. Its a little too specific based on my specific needs, but might be a nice start for someone else needing the same.
It went through 25k files, finding 11k duplicates, moving them to a ./trash/ directory in about 60 seconds.
this code below takes the output of lemmings code above.
#!/usr/bin/perl -w
# usesage: dupDisplay.pl fileMD5.txt [remove]
# input file has the following form:
# 8e773d2546655b84dd1fdd31c735113e 304048 /media/PICTURES-1/my
+media/pictures/pics/20041004-kids-camera/im001020.jpg im001020.jpg
# e01d4d804d454dd1fb6150fc74a0912d 296663 /media/PICTURES-1/my
+media/pictures/pics/20041004-kids-camera/im001021.jpg im001021.jpg
use strict;
use warnings;
my %seen;
my $fileCNT = 0;
my $origCNT = 0;
my $delCNT = 0;
my $failCNT = 0;
my $remove = 'remove' if $ARGV[1];
$remove = '' if !$ARGV[1];
print "\n\n ... running in NON removal mode.\n\n" if !$remove;
open IN,"< $ARGV[0]" or die ".. we don't see a file to read: $ARGV[0]"
+;
open OUT,"> $ARGV[0]_new.temp" or die ".. we can't write the file: $AR
+GV[0]_new.temp";
open OUTdel,"> $ARGV[0]_deleted" or die ".. we can't write the file: $
+ARGV[0]_deleted";
open OUTfail,"> $ARGV[0]_failed" or die ".. we can't write the file: $
+ARGV[0]_failed";
print "\n ... starting to read find duplicats in: $ARGV[0]\n";
if(! -d './trash/'){mkdir './trash/' or die " !! couldn't make trash d
+irectory.\n $! \n";}
while(<IN>){
my $line = $_;
chomp $line;
$fileCNT++;
my ($md5,$filesize,$pathfile,$file) = split /\t+/,$line,4;
if(exists $seen{"$md5:$filesize"}){
my $timenow = time;
my $trashFile = './trash/' . $file . "_" . $timenow; # moves dup
+licate file to trash with timestamp extension.
#if( ! unlink($pathfile){print OUTfail "$pathfile\n"; $failCNT+
++;}
if($remove){if( ! rename $pathfile,$trashFile){print OUTfail "$pa
+thfile\n"; $failCNT++;}}
$seen{"$md5:$filesize"} .= "\n $pathfile";
$delCNT++;
print " files: $fileCNT originals: $origCNT files to delete: $d
+elCNT failed: $failCNT \r";
}else{
$seen{"$md5:$filesize"} = "$pathfile";
printf OUT ("%32s\t%8d\t%s\t%s\n", $md5,$filesize,$pathfile,$file
+);
$origCNT++;
print " files: $fileCNT originals: $origCNT files to delete: $d
+elCNT failed: $failCNT \r";
}
}
foreach my $key (keys %seen){
print OUTdel " $seen{$key}\n";
}
print " files: $fileCNT originals: $origCNT files to delete: $delCNT
+ failed: $failCNT \n\n";
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.