Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re: scalable duplicate file remover

by jwkrahn (Abbot)
on Mar 03, 2008 at 02:47 UTC ( [id://671569]=note: print w/replies, xml ) Need Help??

in reply to scalable duplicate file remover

sub file2sha1 {
    my $file=$_[0];
    return '' if -d $file; #have to find out if to prune when a directory is found that doesn't match the regex
    open my $f,"<$file";
    my $sha1 = Digest::SHA1->new;
    return $sha1->hexdigest;
  1. You should open the file in "binary" mode to work correctly.
  2. You should verify that the file opened correctly.
  3. *$f makes no sense because $f is a lexical variable that contains a reference to a filehandle.
  4. You should probably use $sha1->digest instead which is half the size of $sha1->hexdigest.
sub file2sha1 { my $file = $_[ 0 ]; return '' if -d $file; #have to find out if to prune when a direct +ory is found that doesn't match the regex open my $f, '<:raw', $file or do { warn "Cannot open '$file' $!"; return; }; my $sha1 = Digest::SHA1->new; $sha1->addfile( $f ); return $sha1->digest; }

Replies are listed 'Best First'.
Re^2: scalable duplicate file remover
by spx2 (Deacon) on Mar 03, 2008 at 08:53 UTC
    First of all thank you very much for the critique,it is very well welcomed from my part.
    I will use it to improve the program.
    1)why do you think the current method of opening the files does not yield correct results ?
    (I compared my results of SHA1s against sha1sum unix utilitary and they came out ok,that's
    why I'm asking).
    2)you are right,I will do this
    3)ok I understand,where could I read more about this ?
    4)As I read the documentation and thinking that a number in base 10 should always present more
    digits than its representation in base 16 I dont understand how it could be shorter in base 10.
    I don't get why they say I will get a shorter string in a lower base.

    Also they talk about using a single sha1 object and reusing it because of the reset() method that
    can clear out the old data from it.
    Do you think this will speed up things ?
      1. From the documentation for Digest::SHA1:

        [ SNIP ]
                In most cases you want to make sure that the $io_handle is in "binmode" before you pass it as argument to the addfile() method.

      2. OK.    :-)

      3. Typeglobs and Filehandles
        How do I pass filehandles between subroutines?
        How can I use a filehandle indirectly?

      4. $sha1->digest returns a digest in binary form while $sha1->hexdigest is in hexadecimal form. For example:

        $ perl -le'
        my $digest     = "\x02\x07\xFA\x78";
        my $hex_digest = "0207FA78";
        print for length( $digest ), length( $hex_digest );

      5. Update: reset() may or may not speed things up. You would have to compare both methods with Benchmark to be sure.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://671569]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2024-04-24 20:22 GMT
Find Nodes?
    Voting Booth?

    No recent polls found