Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Portable way to determine if two names refer to the same file?

by jcb (Parson)
on Aug 16, 2019 at 03:31 UTC ( [id://11104544]=perlquestion: print w/replies, xml ) Need Help??

jcb has asked for the wisdom of the Perl Monks concerning the following question:

I have spent the past hour or so trying Super Search and not finding a clear answer, so I ask my fellow monks how best to portably determine if two seemingly different names actually refer to the same physical file?

I am not concerned about copies of the same file, only links, such that the same physical file appears under multiple names.

On POSIX, the solution is easy: compare dev:ino tuples from the stat builtin and declare "same file" if they match. I have no idea if this also works on Windows or even if the problem exists on Windows — how well does Windows handle symlinks anyway and does it even support hardlinks at all?

And what of the less-common platforms?

Cross-posted in Categorized Questions and Answers at How do I portably determine if two filenames refer to the same file? as a place to collect answers for future reference.

Edited 2019-08-16 by jcb: Clarify that stat refers to the Perl builtin. I had forgotten about the shell command with the same name.

Replies are listed 'Best First'.
Re: Portable way to determine if two names refer to the same file?
by LanX (Saint) on Aug 16, 2019 at 07:03 UTC
    Did you try Perl's own stat ?

    You seem to be talking about a shell command.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

    Update

    Ok win doesn't support i node

    perlport#stat

        for the prevailing NTFS, there's a FileID, which seems to be reliable

        In theory, NTFS Master File Table ($MFT) record numbers are NTFS inode numbers, although I suspect that defragmentation tools might be able to sort the MFT, which would make them unstable, but still usable for comparing two files.

        Of course, Microsoft being Microsoft, a bit of research quickly uncovered at least two different API calls for handling this, one of which is new!shiny! in Windows Server 2012 — and apparently is only found in Windows Server and may or may not actually work on all files or may only work on files in ReFS volumes, whatever the hell those are.

        While the 64-bit FileID is not guaranteed to be stable on FAT, FAT does not support links of any type, so simply comparing absolute filenames will work.

        Microsoft claims that a VSN:FileID tuple uniquely identifies a file. GNU claims that a st_dev:st_ino tuple uniquely identifies a file.

        On POSIX systems, device numbers are guaranteed to uniquely identify mounted filesystems, since a device number is the "access path identifier" for a mounted filesystem, but are not guaranteed to remain stable over time. On Windows, the analogous value seems to be the "volume serial number", which is stable across time because it is in the volume header, but its uniqueness is simply assumed and it has no role whatsoever in actually mapping I/O to the underlying storage. I wonder what happens if a Windows box is presented with two disks with the same volume serial number and different contents?

        Back to the point, how to get that VSN:FileID tuple in Perl?

        On the other side, I'd assume Linux' stat's inode to be of no value for FAT file systems…

        Oddly enough, if I understand the kernel sources correctly, the inode number has no meaning in terms of the actual filesystem, but is consistent with the rule that only the same file has the same inode number. This is a trick the kernel plays by keeping track of every inode that anyone is "looking at" and ensuring that each file in the dcache from a FAT filesystem has a unique inode number within that filesystem (or possibly system-wide: I am not entirely certain whether that table is per-filesystem or global). Since any way of examining a file in Linux creates a dcache entry that persists until either the filesystem is unmounted or the kernel recycles the memory, the kernel is able to maintain the illusion that FAT files have stable inode numbers, provided that userspace refrains from "writing them down" and then checking again after the filesystem in question has been unmounted and remounted.

        In short, on Linux, st_dev:st_ino is unique for all immediately accessible disk files, but is not guaranteed to remain stable across reboots or unmounting and remounting a filesystem.

         

        Overall, it looks like the best solution to my problem might be:

        Load File::Spec and then read @File::Spec::ISA to find which implementation it selected, or directly ask perl with File::Spec->isa('File::Spec::Unix'). If File::Spec::Unix was chosen, use the stat builtin and the "file tag" is join(':',(stat($filename))[0,1]), otherwise assume no links and the "file tag" is Cwd::abs_path($filename). Document the caveat and wait for a bug report from someone that actually managed to cause confusion by making links on a non-Unix-like system. (Preceding code is untested.)

      The question was intended to refer to the Perl stat builtin and has been edited to clarify.

Re: Portable way to determine if two names refer to the same file?
by Anonymous Monk on Aug 19, 2019 at 00:14 UTC
    If one considers some of the filesystems that are now available in support of containers, and maybe even network filesystems such as NFS, a test of inodes will not work. I am not sure that there exists a 100%-certain way to do this that will work everywhere.

      If the system claims POSIX conformance and testing dev:ino is unreliable, the system is defective:

      The st_ino and st_dev fields taken together uniquely identify the file within the system.
      Note that st_dev must be unique within a Local Area Network (LAN) in a ``system'' made up of multiple computers' file systems connected by a LAN.
      Networked implementations of a POSIX-conforming system must guarantee that all files visible within the file tree (including parts of the tree that may be remotely mounted from other machines on the network) on each individual processor are uniquely identified by the combination of the st_ino and st_dev fields.
      — Above quotes from https://pubs.opengroup.org/onlinepubs/009695399/basedefs/sys/stat.h.html

      I would be very surprised if traditional Unix did not provide this guarantee, so I am fairly sure that all current "unix" platforms will meet it. If some container tool causes this to be violated, that tool is defective, end of story. There is a very strong expectation that modern "*nix" means POSIX.

      You know this based on personal experience? You wrote a program which attempted to test equality of files using inodes and found that the test doesn't work when one or more of the files are remote?

      Didn't think so.

      So you're basing it on other research you've read, where someone else did that test? Can you link us to at least one such research report?

      Didn't think so.

      As usual.... you are full of hot air.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11104544]
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-25 07:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found