Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

data structure advice please

by anadem (Scribe)
on Nov 25, 2006 at 18:56 UTC ( [id://586037]=perlquestion: print w/replies, xml ) Need Help??

anadem has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to clean up duplicate files on my hard drive. I'm recursing through directories and saving all the file names, with data on path etc., in a hash. When a filename is already in the hash, I save its name into a second hash that holds only the dupe names. My problem is over storign the file data (more accurately, my problem is ignorance of perl data structure management ...) Here's a snippet, showing roughly what I'm trying to do
my %fileinfo = (); # name -> (array-of-path+size+etc) my $href = \%fileinfo; my %dupes = (); # name->count my $dref = \%dupes; #------------------------ foreach # for each file, recursing through directory tree { # here $_ is each file name if( exists %$href->{ $_ } ) # if filename seen already { %$dref->{ $_ } = 1; # then record in %dupes } @filedata = ( $fpath, $fsize, $fdate ); # but using real data # this is where I'm lost -- don't know how to "savedata" @savedata = %$href->{ $_ }; # get data data saved for filename push @savedata, @filedata; # add new data to saved data %$href->{ $_ } = @savedata; # put new data back in the hash }
I want to save each file's data as an array, then add that array to the the hash. At the end of all this, for each file in the dupes, I'll get the display the array of its data, something like
myfile.mp3 c:\dir1\dir2; date=12/3/45; size=12345 c:\dir3\dir4; date=1/01/01; size=54321
Please tell me how to do the step of adding the new data, the one I've dummied as:
@savedata = %$href->{ $_ }; push @savedata, @filedata; %$href->{ $_ } = @savedata;
(which doesn't work as I'd hoped :-( where I'd thought the @savedata would have lots of little @filedata arrays inside it ... thanks for any advice (and hints on how to unpick the "@savedata" would be great too)

Replies are listed 'Best First'.
Re: data structure advice please
by clinton (Priest) on Nov 25, 2006 at 19:35 UTC
    You're getting a bit mixed up with all the sigils (%@$). First, there's no need to do this:

    my %dupes = (); # name->count my $dref = \%dupes;

    You can, but there is no need. And in this case you're not passing the hash around, so I've left it as a hash, rather than a hash-ref.

    Where you DO need a ref is where you store the filedata in %fileinfo. You want an array containing all the files with name X, and you store an array as a hash value by using an array ref.

    @file_info = qw(data1 data2 data3); %files = ( X => \@file_info )

    If you want to add something to the array, then you need to use push, which means you need an array, not an array ref, so you need to dereference it:

    $file_info_ref = $files{X}; push @$file_info_ref,'data5'; or push @{$files{X}}, 'data5'

    So:

    my %fileinfo; my %dupes; # for each file, recursing through directory tree foreach { # here $_ is each file name # if filename seen already if( exists $fileinfo{$_} ) { # then record in %dupes $dupes{$_} = 1; } @filedata = ( $fpath, $fsize, $fdate ); # but using real data # if fileinfo doesn't yet have an entry for this $_, # then assign an empty array ref # saved_info is now an array ref which is # also stored in $fileinfo{$_} $saved_info = $fileinfo{$_} ||= []; push @$saved_info,\@filedata; # No need to re-store it in the hash, because $fileinfo{$_} # and $saved_info both point to the same array }
      you could shorten
      $saved_info = $fileinfo{$_} ||= []; push @$saved_info,\@filedata;
      to
      push @{ $fileinfo{$_} }, \@filedata;
      beacause the arrayref will be autovivified if it didn't already exist.
        True. I thought it may have complained that 'undef' (the initial value of $fileinfo{$_}) was not an array ref, but after testing it, I see it doesn't.
      thanks, especially for the explanations of 'why' - useful in reducing my ignorance! (I should have said I'd used a reference so it could be passed around to other yet-to-be-done subroutines.)
Re: data structure advice please
by mickeyn (Priest) on Nov 25, 2006 at 20:03 UTC
    If you're hunting duplicates - it would have been wiser to check md5 sums (Digest::MD5) rather than names.

    As far as data structure goes - I would recommend something in the form of:

    %files = ( <md5sum1> => [ <path1>, <path2> ... ], <md5sum2> => [ <path1>, <path2> ... ], );
    It will allow you to easily iterate over your files, locate and count them.

    HTH,

    -- Mickey

      It would probably be better to build a HoA keyed by file size ((stat ($file))[7]) rather than MD5 sum, with values being arrays of files of a particular size. Any two files of different size cannot be duplicates, obviously. Any hash element that contained just one file could then be discarded, thus avoiding the expense of MD5 sums or file comparisons for a proportion of the files you are testing.

      Once you have sets of files the same size you can compare them either by generating MD5 sums, by reading the files (slurping if small or in chunks if large) and doing string comparisons or by using external commands like cmp. (I would recommend against using external commands.) You can save a lot of time by avoiding re-doing comparisons when you have several files of the same size. For example, given fileA to fileE, you would logically start by comparing fileA to the other four in turn, then fileB to fileC, fileD and fileE, and so on. If fileA differs from fileB but is the same as fileE you can see that it is not necessary to compare fileB with fileE because you already know they differ.

      I hope these thoughts are of use.

      Cheers,

      JohnGG

      thanks, that will get binary dupes nicely. I wanted to start with duped filenames and leave binaries for a later pass - I find quite a bit of the music my kids leave around has different binary content but same filenames, but the vice-versa case happens too so I'll add md5 summing as an option. I also want to save the date and size as well as the path+name (hence the array). The bit I've been most unclear on is how to go from finding the first file and setting
      key1 => [ <path1> ]
      to adding the data for the second instance:
      key1 => [ <path1>, <path2> ]
      but hopefully I can put the suggestions above into practise now

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://586037]
Approved by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (3)
As of 2024-03-29 06:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found