Re: scalable duplicate file remover

sub file2sha1 {
my $file=$_[0];
return '' if -d $file; #have to find out if to prune when a directory is found that doesn't match the regex
open my $f,"<$file";
my $sha1 = Digest::SHA1->new;
$sha1->addfile(*$f);
return $sha1->hexdigest;
}

You should open the file in "binary" mode to work correctly.
You should verify that the file opened correctly.
*$f makes no sense because $f is a lexical variable that contains a reference to a filehandle.
You should probably use $sha1->digest instead which is half the size of $sha1->hexdigest.

sub file2sha1 {
    my $file = $_[ 0 ];
    return '' if -d $file; #have to find out if to prune when a direct
+ory is found that doesn't match the regex
    open my $f, '<:raw', $file or do {
        warn "Cannot open '$file' $!";
        return;
        };
    my $sha1 = Digest::SHA1->new;
    $sha1->addfile( $f );
    return $sha1->digest;
}
[download]

Comment on Re: scalable duplicate file remover Select or Download Code

Replies are listed 'Best First'.
Re^2: scalable duplicate file remover by spx2 (Deacon) on Mar 03, 2008 at 08:53 UTC
First of all thank you very much for the critique,it is very well welcomed from my part. I will use it to improve the program. 1)why do you think the current method of opening the files does not yield correct results ? (I compared my results of SHA1s against sha1sum unix utilitary and they came out ok,that's why I'm asking). 2)you are right,I will do this 3)ok I understand,where could I read more about this ? 4)As I read the documentation and thinking that a number in base 10 should always present more digits than its representation in base 16 I dont understand how it could be shorter in base 10. I don't get why they say I will get a shorter string in a lower base. Also they talk about using a single sha1 object and reusing it because of the reset() method that can clear out the old data from it. Do you think this will speed up things ?	[reply]
Re^3: scalable duplicate file remover by jwkrahn (Abbot) on Mar 03, 2008 at 18:18 UTC
From the documentation for Digest::SHA1: `$sha1->addfile($io_handle)` `[ SNIP ]` `In most cases you want to make sure that the $io_handle is in "binmode" before you pass it as argument to the addfile() method.` OK. `:-)` Typeglobs and Filehandles How do I pass filehandles between subroutines? How can I use a filehandle indirectly? `$sha1->digest` returns a digest in binary form while `$sha1->hexdigest` is in hexadecimal form. For example: `$ perl -le'` `my $digest = "\x02\x07\xFA\x78";` `my $hex_digest = "0207FA78";` `print for length( $digest ), length( $hex_digest );` `'` `4` `8` Update: `reset()` may or may not speed things up. You would have to compare both methods with Benchmark to be sure.	[reply] [d/l] [select]


The stupid question is the question not asked
	PerlMonks