Re: data structure advice please

If you're hunting duplicates - it would have been wiser to check md5 sums (Digest::MD5) rather than names.

As far as data structure goes - I would recommend something in the form of:

%files = (
   <md5sum1> => [ <path1>, <path2> ... ],
   <md5sum2> => [ <path1>, <path2> ... ],
);
[download]

It will allow you to easily iterate over your files, locate and count them.

HTH,

-- Mickey

Comment on Re: data structure advice please Download Code

Replies are listed 'Best First'.
Re^2: data structure advice please by johngg (Canon) on Nov 25, 2006 at 23:02 UTC
It would probably be better to build a HoA keyed by file size (`(stat ($file))[7]`) rather than MD5 sum, with values being arrays of files of a particular size. Any two files of different size cannot be duplicates, obviously. Any hash element that contained just one file could then be discarded, thus avoiding the expense of MD5 sums or file comparisons for a proportion of the files you are testing. Once you have sets of files the same size you can compare them either by generating MD5 sums, by reading the files (slurping if small or in chunks if large) and doing string comparisons or by using external commands like `cmp`. (I would recommend against using external commands.) You can save a lot of time by avoiding re-doing comparisons when you have several files of the same size. For example, given fileA to fileE, you would logically start by comparing fileA to the other four in turn, then fileB to fileC, fileD and fileE, and so on. If fileA differs from fileB but is the same as fileE you can see that it is not necessary to compare fileB with fileE because you already know they differ. I hope these thoughts are of use. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: data structure advice please by anadem (Scribe) on Nov 25, 2006 at 21:53 UTC
thanks, that will get binary dupes nicely. I wanted to start with duped filenames and leave binaries for a later pass - I find quite a bit of the music my kids leave around has different binary content but same filenames, but the vice-versa case happens too so I'll add md5 summing as an option. I also want to save the date and size as well as the path+name (hence the array). The bit I've been most unclear on is how to go from finding the first file and setting `key1 => [ <path1> ]` [download] to adding the data for the second instance: `key1 => [ <path1>, <path2> ]` [download] but hopefully I can put the suggestions above into practise now	[reply] [d/l] [select]


The stupid question is the question not asked
	PerlMonks