Replace duplicate files with hardlinks

Replies are listed 'Best First'.
Re: Replace duplicate files with hardlinks by graff (Chancellor) on Aug 10, 2008 at 21:17 UTC
Monks not familiar with "hard links" would need to understand the following details: The concept of hard links applies only to unix/linux (including macosx). Hard links only work within a given disk volume (you can't have a hard link on one disk that points to a file on another disk). Hard links only apply to data files, not to directories or other file types (e.g. devices, symbolic links). Creating one or more hard links to a given file is really just a matter of having more directory entries describing/pointing to that file. Once a hard link is created, you can't really identify it as such (i.e. as anything other than a plain data file). You can figure out when a given file has more than one directory entry describing/pointing to it (checking the link count shown by "ls -l"), and you can figure out which directory entries point to the same file (checking for matching inode numbers with "ls -i") *, but all entries have "equal status" -- the original directory entry is simply equivalent to (i.e. one of) the hard links. With those details in mind, I suspect that if you run your script repeatedly in succession on the same path, it will find/rename/replace/delete the same set of duplicate files, more or less identically, on each run. There's nothing in the File::Find::Duplicates man page about how it determines files to be duplicates, and there is no reason to expect that it knows or cares about existing hard links (since these are not mentioned in the docs, and are OS-dependent anyway). So, existing hard links will probably look like duplicates, and will be (re)replaced on every run. For that matter, I wonder what that module would do if you were to replace duplicate files with symbolic links instead of hard ones. I think the nx notion of "symlinks" ports to MS-Windows as "short-cuts", so this may be somewhat more portable, but you'd have to look at the sources for F::F::Dups to see whether it picks up on the difference between a data file and any sort of link. In any case, I tend to prefer symlinks anyway -- there tends to be less confusion when it comes to figuring out actual vs. apparent disk space usage. And that brings up another point you might want to test with your script: does F::F::Dups know enough to leave symlinks alone, or does it follow them when looking for dups? If the latter, you can get into various kinds of trouble, like trying to create hard links to files on different volumes (won't work) or even deleting the target of a symlink while leaving the symlink itself as the "unique version" -- which then becomes a stale link with no existing data file as the target. Note that a symlink can have a directory as its target (as well as files/directories on different disks), so if your script runs on a tree like this: `toplevel/ secondlevel_1/ thirdlevel_1/ thirdlevel_2/ file1.dat file2.dat secondlevel_2 -> secondlevel_1/thirdlevel_2 # directory symlin +k` [download] will there be an apparent duplication of file1.dat and file2.dat under two different paths? If so, ~~I think~~ what is the likelihood that your script will have (or cause) some trouble? * FOOTNOTE (UPDATE) ** Please note the very informative reply provided below by MidLifeXis. As he points out, my references to "ls -l" and "ls -i" should not be taken as implementation ideas for detecting hard links in a perl script. I mentioned these uses of "ls" merely to cite the easiest way for a person to look into the behaviors of hard links.	[reply] [d/l]
Re^2: Replace duplicate files with hardlinks by MidLifeXis (Monsignor) on Aug 11, 2008 at 16:55 UTC
Once a hard link is created, you can't really identify it as such (i.e. as anything other than a plain data file). You can figure out when a given file has more than one directory entry describing/pointing to it (checking the link count shown by "ls -l"), and you can figure out which directory entries point to the same file (checking for matching inode numbers with "ls -i"), but all entries have "equal status" -- the original directory entry is simply equivalent to (i.e. one of) the hard links. [emphasis added] I have a feeling that we will be speaking at different facets of the problem at hand, but when I read your response, it says to me that the program will have a hard time identifying that a file is a hard link. I would probably make clear that the program as written would have a hard time identifying the duplicates. The application could postprocess the F::F::D output and remove those files already hard linked by using the stat perl builtin. Given the device + inode + hash, you have a hardlink check. I just had the impression, even if it was not intended, that a reader of this response could come away with the feeling that you needed to poll `ls` to determine if a file was a hardlink of another. If you are interested in more detail on the hardlink stuff and how the underlying file system can implement them, see: Unix File System Modern Operating Systems^* Operating System Concepts^* ^*My college reference books on this topic are at home, the revisions have changed (as well as the covers), and my memory is, umm, rusty :). So beware, these books may not be the ones I am thinking of. --MidLifeXis	[reply] [d/l]
Re^2: Replace duplicate files with hardlinks by Anonymous Monk on Aug 10, 2008 at 22:09 UTC
With those details in mind, I suspect that if you run your script repeatedly in succession on the same path, it will find/rename/replace/delete the same set of duplicate files, more or less identically, on each run. You are right. At first I thought it was a bug in my script, but then I realized that, as there is no way of recognizing a hard link as such, repeated runs of the program on the same directory will report identical results. I think the nx notion of "symlinks" ports to MS-Windows as "short-cuts", so this may be somewhat more portable, but you'd have to look at the sources for F::F::Dups to see whether it picks up on the difference between a data file and any sort of link. I checked the source of the module, and it only reports real duplicates. Soft links are discarded by the -f file test, wich returns 0 if the "element" is a directory or a soft link. However, I'm only now thinking about the posibility of creating soft links and the consequences that I might have. I hadn't considered the possibility of running the script in a non-unix envirnomnet either. And that brings up another point you might want to test with your script: does F::F::Dups know enough to leave symlinks alone, or does it follow them when looking for dups? Luckily enough, it doesn't. F::F::Dups uses File::Find with somewhat default options, and in that regard the default is not to follow links. So the problem that you most correctly point out is not an issue here (but thanks for mentioning it because I hadn't considered it!).	[reply]
Re^3: Replace duplicate files with hardlinks by repellent (Priest) on Aug 11, 2008 at 04:04 UTC
I checked the source of the module, and it only reports real duplicates. Soft links are discarded by the -f file test, wich returns 0 if the "element" is a directory or a soft link. That's not quite how it works with softlinks. When you perform a file test on a softlink, think of it as performing the file test on the non-softlink target linked file (whether it be a plain file, directory, special file, etc.) The only file test that is applicable to the softlink inode itself is the `-l` operator. Purpose: to find out if it's a softlink. Hence, directories are discarded as a result of the `-f file` test, but not softlinks. You may be thinking that softlinks are discarded because they're pointing to directories, perhaps? However, I'm only now thinking about the posibility of creating soft links and the consequences that I might have. I hadn't considered the possibility of running the script in a non-unix envirnomnet either. I would recommend creating softlinks instead of hardlinks. It's more apparent. But then you need to decide which inode of the duplicates becomes the softlinks' target.	[reply] [d/l] [select]
Re^4: Replace duplicate files with hardlinks by bruno (Friar) on Aug 11, 2008 at 04:55 UTC
Re^3: Replace duplicate files with hardlinks by repellent (Priest) on Aug 11, 2008 at 16:28 UTC
You are right. At first I thought it was a bug in my script, but then I realized that, as there is no way of recognizing a hard link as such, repeated runs of the program on the same directory will report identical results. I'd like to add that you can distinguish hardlinks by inode numbers. When you have your group of duplicate-content files, hash them by inode numbers: `push @{ $hash{$inode} }, $path;` [download] When you have more than one key in the hash, decide which hardlink's inode you like and `link` the other paths to it. When you have only one key in the hash, you're done!	[reply] [d/l] [select]
Re^3: Replace duplicate files with hardlinks by bruno (Friar) on Aug 10, 2008 at 22:16 UTC
Again, me, sorry.	[reply]
Re: Replace duplicate files with hardlinks by betterworld (Curate) on Aug 10, 2008 at 20:59 UTC
Nice script :) The error-checking looks quite robust. However, may I suggest a change to the order of the system calls: Currently, you move one of the dupes to a temporary location, then link the original file to the old location. I'd suggest first link()ing the original to a temporary location, then rename()ing that temporary name to the location of the dupe (rename atomically overwrites its destination if it exists). The advantage would be that there is no time window where any file is not accessible by its original filename. If the script crashes (like between rename and link), the worst thing that can happen is that it leaves an unneeded temporary file. This might sound somewhat academic, but I like to avoid as much race conditions as possible in the scripts that I write :)	[reply]
Re^2: Replace duplicate files with hardlinks by bruno (Friar) on Aug 10, 2008 at 21:45 UTC
Thanks for the comment! I agree with you; your approach makes the script more secure, and it doesn't require any extra complexity. I'll change it as you suggest.	[reply]
Re: Replace duplicate files with hardlinks by ajt (Prior) on Aug 10, 2008 at 20:42 UTC
This is a common problem and your solution is not dissimilar from others. There is a nice section on Wikikpedia about the various duplicate finders and linking tools: fdupes. As the author of one of the tools (fdf) I have a vested interest and I can say that it's easy to make something fast, it's harder to make it fast and reliable. I've also had a conversation with the author of fslint - there are more than a few land-mines out there! For example lots of operating systems have primitive file-systems that don't do hard-links... Anyhow it's nice to see another version and it's always fun to try benchmarking them! -- ajt	[reply]
Re^2: Replace duplicate files with hardlinks by Anonymous Monk on Aug 10, 2008 at 21:28 UTC
Ajt: Thanks a lot for your comments! As as I said in my previous post, the main purpose of the script was to test some recently acquired skills; I wasn't aiming at trying to compete with a full featured program. I was aware that there were many other tools for this job already, but I just wanted to give it a go for the fun of it. Please forgive me if it sounded like it was a new solution to a not yet solved problem. I am the first to acknowledge that, in its current state, this little script is far from being robust and safe to use by anyone in any platform.	[reply]
Re^3: Replace duplicate files with hardlinks by bruno (Friar) on Aug 10, 2008 at 21:35 UTC
Sorry, I am the Anonymous Monk here. Still figuring things out...	[reply]
Re: Replace duplicate files with hardlinks by bruno (Friar) on Aug 12, 2008 at 04:54 UTC
Thanks to all of your fruitfull comments! I took in as much as I could, and tried to apply it to the script. The most importante changes were: Rewrite of the linking algorithm, as suggested by betterworld Now it only requires two steps instead of three, and is much safer. Addition of optional arguments. Now you can choose whether to create a soft link, a hard link, delete the file or just look at the report (thanks graff!) Consideration of hard links. I used repellent's idea to group the duplicates according to their inode. Now the program doesn't count a hardlink as a duplicate, and running the script after a first run of linking/deleting, the wasted space count is (as it should be) zero. Minor changes in the information displayed (wasted space / saved space). Documentation! Thanks to toolic for suggesting it. And last but not least, the name! I could not find a proper name, so I used toolic's take. Read more... (612 Bytes) Here's the code: Read more... (7 kB)	[reply] [d/l]
Re: Title Update: Rewrite! (title rewrites--) by tye (Sage) on Aug 12, 2008 at 05:13 UTC
This node of yours is not about "Title Update: Rewrite". The only place that your title choice makes sense is when viewed as part of its thread. So for most places where titles are displayed (the many places that list nodes including search results, Newest Nodes, RSS feeds, etc.) your title is mostly misleading. Please include part of the parent node's title in your reply's title, as it almost always makes the title more accurately reflect the subject matter of the node. It also means that people can likely recognize whether it is in reply to something they recently took an interest in. Please especially keep the "Re^N:" prefix unmodified as the depth of a reply actually conveys a lot of information about the likely character of a node. Thanks. - tye	[reply]
Re: Replace duplicate files with hardlinks by toolic (Bishop) on Aug 11, 2008 at 17:28 UTC
seeking for your opinions in every aspect Since you are posting this code here for others to use, add POD (perlpod). POD adds a standard way of describing to other users what the code does and how it should be used, as well as any known limitations. If you are unfamiliar with POD, then this is an opportunity to add this to your toolkit. Update: For example, a standard way to get help on a program is to use perldoc. If I downloaded your code and named it "dup2link", then, from my command line I could: `perldoc dup2link` [download]	[reply] [d/l]
Re: Replace duplicate files with hardlinks by myuserid7 (Scribe) on Jul 04, 2013 at 20:58 UTC
I get an error message Can't call method "inode" without a package or object reference at dup2link line 158. This happened after running this script for a minute or so. Mac OS X 10.8	[reply]
Re^2: Replace duplicate files with hardlinks by Anonymous Monk on Jul 05, 2013 at 10:00 UTC
Given ... `... push @{ $real_dups{ stat($filename)->inode } }, $filename; ...` [download] ... myuserid7 (*{), you may be using built-in `&stat` function and not `&File::Stat::stat`. That, or `&File::Stat::stat` failed for some reason and did not return an object as expected (to state the obvious). In which case print the file path being questioned and verify its properties yourself when you see the error message again. Could you verify that one way or another?	[reply] [d/l] [select]
Re: Replace duplicate files with hardlinks by islam (Initiate) on Nov 04, 2012 at 01:04 UTC
If you want to replace duplicates by Hard Links on mac or any UNIX based System, you can try SmartDupe http://sourceforge.net/projects/smartdupe/ am developing it	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks