Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Replace duplicate files with hardlinks

by bruno (Friar)
on Aug 10, 2008 at 19:37 UTC ( [id://703460]=CUFP: print w/replies, xml ) Need Help??

Greetings, Monks!

This is my first post here, so please feel free to redirect this to any other section if this is not the place where it belongs.

I am posting this little script seeking for your opinions in every aspect: design, layout, readability, speed, etc. It uses File::Find::Duplicates to find duplicate files recursively in a directory and, instead of just informing about them or deleting them, it creates hardlinks so that the disk space is freed but the files do remain. I wrote it to practice some of the things that I'm trying to learn, but I found it quite useful for my /home directory (I could free 2 GB!).

I thought that creating a hard link might be a better idea than deleting the file, as sometimes one wants a certain file to be under a certain path.
It also helped me to find severe redundancies in some "dot directories". For instance, in a couple of icon packages, ~30% were duplicates with different names. In this case deleting them would have ruined the icon set, but creating hard links both freed space and kept the package functional.

I was also pleasantly surprised that it is quite fast. I haven't benchmarked it (I haven't read the Benchmark documentation yet), but it is sensibly faster than, for example, the fdupes program that comes with Ubuntu (and probably other Debian-based distros).
Of course the merit of this goes entirely to Tony Bowden, the author of the module.

Here's the code for it:

#!/usr/bin/perl -w use strict; use File::Find::Duplicates; use File::Temp (); my %stats = ( files_linked => 0, space_saved => 0 ); local $" = "\n"; # Read directory from command line, or default to current. my $directory = $ARGV[0] || "."; # Find duplicates recursively in such directory my @dupes = find_duplicate_files($directory); # For each set of duplicate files, create the hardlinks and save the # information in the stats hash foreach my $set (@dupes) { print $set->size, " bytes each:\n", "@{ $set->files }\n"; my $original = shift @{ $set->files }; my $number_linked = fuse( $original, \@{ $set->files } ); $stats{files_linked} += $number_linked; $stats{space_saved} += $number_linked * $set->size; } # Report the stats print "Files linked: $stats{ files_linked }\n"; print "Space saved: $stats{ space_saved } bytes\n"; sub fuse { # Replace duplicates with hard links and return the number # of links created. my $original = shift; my $duplicates = shift; my $files_linked; foreach my $duplicate (@$duplicates) { # Step 1: link original to tempfile my $tempfile = File::Temp::tempnam( $directory, 'X' x 6 ); link $original, $tempfile or next; # Step 2: move tempfile to duplicate unless ( rename $tempfile, $duplicate ) { next; } if ( -e $tempfile ) { unlink $tempfile or die "Couldn't delete temporary file $tempfile: $!"; } ++$files_linked; } return $files_linked; }

Update: Subrutine fuse() changed following betterworld's suggestion.

Update 2: Link filtering, soft link / remove support and documentation. Here

Replies are listed 'Best First'.
Re: Replace duplicate files with hardlinks
by graff (Chancellor) on Aug 10, 2008 at 21:17 UTC
    Monks not familiar with "hard links" would need to understand the following details:
    • The concept of hard links applies only to unix/linux (including macosx).
    • Hard links only work within a given disk volume (you can't have a hard link on one disk that points to a file on another disk).
    • Hard links only apply to data files, not to directories or other file types (e.g. devices, symbolic links).
    • Creating one or more hard links to a given file is really just a matter of having more directory entries describing/pointing to that file.
    • Once a hard link is created, you can't really identify it as such (i.e. as anything other than a plain data file). You can figure out when a given file has more than one directory entry describing/pointing to it (checking the link count shown by "ls -l"), and you can figure out which directory entries point to the same file (checking for matching inode numbers with "ls -i") **, but all entries have "equal status" -- the original directory entry is simply equivalent to (i.e. one of) the hard links.

    With those details in mind, I suspect that if you run your script repeatedly in succession on the same path, it will find/rename/replace/delete the same set of duplicate files, more or less identically, on each run.

    There's nothing in the File::Find::Duplicates man page about how it determines files to be duplicates, and there is no reason to expect that it knows or cares about existing hard links (since these are not mentioned in the docs, and are OS-dependent anyway). So, existing hard links will probably look like duplicates, and will be (re)replaced on every run.

    For that matter, I wonder what that module would do if you were to replace duplicate files with symbolic links instead of hard ones. I think the *n*x notion of "symlinks" ports to MS-Windows as "short-cuts", so this may be somewhat more portable, but you'd have to look at the sources for F::F::Dups to see whether it picks up on the difference between a data file and any sort of link.

    In any case, I tend to prefer symlinks anyway -- there tends to be less confusion when it comes to figuring out actual vs. apparent disk space usage.

    And that brings up another point you might want to test with your script: does F::F::Dups know enough to leave symlinks alone, or does it follow them when looking for dups? If the latter, you can get into various kinds of trouble, like trying to create hard links to files on different volumes (won't work) or even deleting the target of a symlink while leaving the symlink itself as the "unique version" -- which then becomes a stale link with no existing data file as the target. Note that a symlink can have a directory as its target (as well as files/directories on different disks), so if your script runs on a tree like this:

    toplevel/ secondlevel_1/ thirdlevel_1/ thirdlevel_2/ file1.dat file2.dat secondlevel_2 -> secondlevel_1/thirdlevel_2 # directory symlin +k
    will there be an apparent duplication of file1.dat and file2.dat under two different paths? If so, I think what is the likelihood that your script will have (or cause) some trouble?

    ** FOOTNOTE (UPDATE) ** Please note the very informative reply provided below by MidLifeXis. As he points out, my references to "ls -l" and "ls -i" should not be taken as implementation ideas for detecting hard links in a perl script. I mentioned these uses of "ls" merely to cite the easiest way for a person to look into the behaviors of hard links.

      Once a hard link is created, you can't really identify it as such (i.e. as anything other than a plain data file). You can figure out when a given file has more than one directory entry describing/pointing to it (checking the link count shown by "ls -l"), and you can figure out which directory entries point to the same file (checking for matching inode numbers with "ls -i"), but all entries have "equal status" -- the original directory entry is simply equivalent to (i.e. one of) the hard links. [emphasis added]

      I have a feeling that we will be speaking at different facets of the problem at hand, but when I read your response, it says to me that the program will have a hard time identifying that a file is a hard link. I would probably make clear that the program as written would have a hard time identifying the duplicates.

      The application could postprocess the F::F::D output and remove those files already hard linked by using the stat perl builtin. Given the device + inode + hash, you have a hardlink check.

      I just had the impression, even if it was not intended, that a reader of this response could come away with the feeling that you needed to poll ls to determine if a file was a hardlink of another.

      If you are interested in more detail on the hardlink stuff and how the underlying file system can implement them, see:

      *My college reference books on this topic are at home, the revisions have changed (as well as the covers), and my memory is, umm, rusty :). So beware, these books may not be the ones I am thinking of.

      --MidLifeXis

      With those details in mind, I suspect that if you run your script repeatedly in succession on the same path, it will find/rename/replace/delete the same set of duplicate files, more or less identically, on each run.

      You are right. At first I thought it was a bug in my script, but then I realized that, as there is no way of recognizing a hard link as such, repeated runs of the program on the same directory will report identical results.

      I think the *n*x notion of "symlinks" ports to MS-Windows as "short-cuts", so this may be somewhat more portable, but you'd have to look at the sources for F::F::Dups to see whether it picks up on the difference between a data file and any sort of link.

      I checked the source of the module, and it only reports real duplicates. Soft links are discarded by the -f file test, wich returns 0 if the "element" is a directory or a soft link.

      However, I'm only now thinking about the posibility of creating soft links and the consequences that I might have. I hadn't considered the possibility of running the script in a non-unix envirnomnet either.

      And that brings up another point you might want to test with your script: does F::F::Dups know enough to leave symlinks alone, or does it follow them when looking for dups?

      Luckily enough, it doesn't.
      F::F::Dups uses File::Find with somewhat default options, and in that regard the default is not to follow links. So the problem that you most correctly point out is not an issue here (but thanks for mentioning it because I hadn't considered it!).

          I checked the source of the module, and it only reports real duplicates. Soft links are discarded by the -f file test, wich returns 0 if the "element" is a directory or a soft link.

        That's not quite how it works with softlinks. When you perform a file test on a softlink, think of it as performing the file test on the non-softlink target linked file (whether it be a plain file, directory, special file, etc.)

        The only file test that is applicable to the softlink inode itself is the -l operator. Purpose: to find out if it's a softlink.

        Hence, directories are discarded as a result of the -f file test, but not softlinks. You may be thinking that softlinks are discarded because they're pointing to directories, perhaps?

          However, I'm only now thinking about the posibility of creating soft links and the consequences that I might have. I hadn't considered the possibility of running the script in a non-unix envirnomnet either.

        I would recommend creating softlinks instead of hardlinks. It's more apparent. But then you need to decide which inode of the duplicates becomes the softlinks' target.
          You are right. At first I thought it was a bug in my script, but then I realized that, as there is no way of recognizing a hard link as such, repeated runs of the program on the same directory will report identical results.

        I'd like to add that you can distinguish hardlinks by inode numbers. When you have your group of duplicate-content files, hash them by inode numbers:
        push @{ $hash{$inode} }, $path;
        When you have more than one key in the hash, decide which hardlink's inode you like and link the other paths to it.

        When you have only one key in the hash, you're done!
        Again, me, sorry.
Re: Replace duplicate files with hardlinks
by betterworld (Curate) on Aug 10, 2008 at 20:59 UTC

    Nice script :) The error-checking looks quite robust. However, may I suggest a change to the order of the system calls:

    Currently, you move one of the dupes to a temporary location, then link the original file to the old location. I'd suggest first link()ing the original to a temporary location, then rename()ing that temporary name to the location of the dupe (rename atomically overwrites its destination if it exists).

    The advantage would be that there is no time window where any file is not accessible by its original filename. If the script crashes (like between rename and link), the worst thing that can happen is that it leaves an unneeded temporary file. This might sound somewhat academic, but I like to avoid as much race conditions as possible in the scripts that I write :)

      Thanks for the comment! I agree with you; your approach makes the script more secure, and it doesn't require any extra complexity. I'll change it as you suggest.
Re: Replace duplicate files with hardlinks
by ajt (Prior) on Aug 10, 2008 at 20:42 UTC

    This is a common problem and your solution is not dissimilar from others. There is a nice section on Wikikpedia about the various duplicate finders and linking tools: fdupes.

    As the author of one of the tools (fdf) I have a vested interest and I can say that it's easy to make something fast, it's harder to make it fast and reliable. I've also had a conversation with the author of fslint - there are more than a few land-mines out there! For example lots of operating systems have primitive file-systems that don't do hard-links...

    Anyhow it's nice to see another version and it's always fun to try benchmarking them!


    --
    ajt
      Ajt: Thanks a lot for your comments! As as I said in my previous post, the main purpose of the script was to test some recently acquired skills; I wasn't aiming at trying to compete with a full featured program. I was aware that there were many other tools for this job already, but I just wanted to give it a go for the fun of it.

      Please forgive me if it sounded like it was a new solution to a not yet solved problem. I am the first to acknowledge that, in its current state, this little script is far from being robust and safe to use by anyone in any platform.

        Sorry, I am the Anonymous Monk here. Still figuring things out...
Re: Replace duplicate files with hardlinks
by bruno (Friar) on Aug 12, 2008 at 04:54 UTC
    Thanks to all of your fruitfull comments! I took in as much as I could, and tried to apply it to the script. The most importante changes were:

    1. Rewrite of the linking algorithm, as suggested by betterworld Now it only requires two steps instead of three, and is much safer.
    2. Addition of optional arguments. Now you can choose whether to create a soft link, a hard link, delete the file or just look at the report (thanks graff!)
    3. Consideration of hard links. I used repellent's idea to group the duplicates according to their inode. Now the program doesn't count a hardlink as a duplicate, and running the script after a first run of linking/deleting, the wasted space count is (as it should be) zero.
    4. Minor changes in the information displayed (wasted space / saved space).
    5. Documentation! Thanks to toolic for suggesting it.
    6. And last but not least, the name! I could not find a proper name, so I used toolic's take.

    Here's the code:

      This node of yours is not about "Title Update: Rewrite". The only place that your title choice makes sense is when viewed as part of its thread. So for most places where titles are displayed (the many places that list nodes including search results, Newest Nodes, RSS feeds, etc.) your title is mostly misleading.

      Please include part of the parent node's title in your reply's title, as it almost always makes the title more accurately reflect the subject matter of the node. It also means that people can likely recognize whether it is in reply to something they recently took an interest in. Please especially keep the "Re^N:" prefix unmodified as the depth of a reply actually conveys a lot of information about the likely character of a node.

      Thanks.

      - tye        

Re: Replace duplicate files with hardlinks
by toolic (Bishop) on Aug 11, 2008 at 17:28 UTC
    seeking for your opinions in every aspect
    Since you are posting this code here for others to use, add POD (perlpod). POD adds a standard way of describing to other users what the code does and how it should be used, as well as any known limitations. If you are unfamiliar with POD, then this is an opportunity to add this to your toolkit.

    Update: For example, a standard way to get help on a program is to use perldoc. If I downloaded your code and named it "dup2link", then, from my command line I could:

    perldoc dup2link
Re: Replace duplicate files with hardlinks
by myuserid7 (Scribe) on Jul 04, 2013 at 20:58 UTC
    I get an error message Can't call method "inode" without a package or object reference at dup2link line 158. This happened after running this script for a minute or so. Mac OS X 10.8

      Given ...

      ... push @{ $real_dups{ stat($filename)->inode } }, $filename; ...

      ... myuserid7 (*{), you may be using built-in &stat function and not &File::Stat::stat.

      That, or &File::Stat::stat failed for some reason and did not return an object as expected (to state the obvious). In which case print the file path being questioned and verify its properties yourself when you see the error message again.

      Could you verify that one way or another?

Re: Replace duplicate files with hardlinks
by islam (Initiate) on Nov 04, 2012 at 01:04 UTC
    If you want to replace duplicates by Hard Links on mac or any UNIX based System, you can try SmartDupe http://sourceforge.net/projects/smartdupe/ am developing it

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://703460]
Approved by Corion
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (2)
As of 2024-04-26 01:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found