http://qs321.pair.com?node_id=581984

davebaker has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, monks!

I would like to modify a file in place using the method recommended as the "best" method in the Perl Cookbook, 2d ed., recipe 7.15, Modifying a File in Place with a Temporary File. I don't understand something the authors say; it seems critical to understand it, though.

(The file to be modified in my case is an important flat-file database (one record per line); web users can use a CGI script to either add data to the file or to edit their own records; I'm concerned about possible file corruption when two or more users are submitting new or revised data at about the same instant. I know I could use a real database but I really want to figure out file locking using Perl. Seems like this issue must come up all the time in a multiuser environment, whether web or internal network.)

The code provided in the recipe is:

open( OLD, "<", $old ) or die "can't open $old: $!"; open( NEW, ">", $new ) or die "can't open $new: $!"; while (<OLD>) { # change $_, then ... print NEW $_ or die "can't write $new: $!"; } close( OLD ) or die "can't close $old: $!"; close( NEW ) or die "can't close $new: $!"; rename( $old, "$old.orig" ) or die "can't rename $old to $old.orig: + $!"; rename( $new, $old ) or die "can't rename $new to $old: $!";

Some discussion follows, then the authors say:

Note that rename won't work across filesystems, so you should create your temporary file in the same directory as the file being modified.

The truly paranoid programmer would lock the file during the update. The tricky part is that you have to open the file for writing without destroying its contents before you can get a lock to modify it. Recipe 7.18 shows how to do this.

(Emphasis supplied by me.)

Q1: In "The truly paranoid programmer would lock the file", which file are the authors referring to?

Q2: Regarding the reason for being "truly paranoid" -- is this because we don't want another running instance of this script to be writing to $new while we are, so we ought to revise this script (and hence both instances) to get a LOCK_EX before writing to $new?

To get the desired file lock, the authors caution that the "tricky part" is to first open the file for writing without clobbering its contents. I have read elsewhere in the book that "open (OUT, ">", $out)" would "clobber" any existing file named $out before a script would have a chance to get a lock on the file, and I've read (p. 421 of Programming Perl, 3d ed.) that the best method for writing to a file is to use sysopen, which does not clobber any file that exists, as in:

use Fcntl qw( :flock :DEFAULT ); sysopen( OUT, $out, O_WRONLY|O_CREAT ) or die "can't sysopen $out: $!" +; flock( OUT, LOCK_EX ) or die "can't flock $out: $!"; truncate( OUT, 0) or die "can't truncate $out: $! +"; # now write to file... close( OUT ) or die "can't close $out: $!";

Q3: I'm not sure I completely understand the hazards of "clobbering." Is the problem the fact that $new might exist already because another instance of this script running at the same time had created $new a split-second ago in connection with its own update of $old, and that our process will destroy the contents of that $new due to the way ">" works, thereby causing the other instance (e.g., another web user submitting data via the same page's form) to produce mangled or empty data when that instance renames $new to $old? Yikes, there goes the database.

Q4: In a multi-user environment, does a careful programmer need to use "sysopen/flock LOCK_EX/truncate" every time a script needs to write a file? If a plain open ">" technique is used there would seem to be a potential clobbering problem.

Q5: A final wrinkle on the addition of a file lock for $new in the recipe: wouldn't we would want to keep $new open (and hence the LOCK_EX in place) until after the "rename( $new, $old )"? Would that work, though? I'm concerned that the rename function implicitly closes the file being renamed and breaks the lock on it before doing something as drastic as renaming it.

  • Comment on Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function
  • Select or Download Code

Replies are listed 'Best First'.
Re: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function
by grep (Monsignor) on Nov 03, 2006 at 02:25 UTC
    Q1: In "The truly paranoid programmer would lock the file", which file are the authors referring to?
    The $old file. This is data that would get clobbered (assuming you are not using the same name for the temp $new file). BTW I would name the files $orig and $tmp - that seems to make more sense.

    Q2: Regarding the reason for being "truly paranoid" -- is this because we don't want another running instance of this script to be writing to $new while we are,
    Nope, You're stuck on the $new temp file when the $old original file is what you should be concerned about. You should be using File::Temp to get a uniquely named $new temp file.

    I'm not sure I completely understand the hazards of "clobbering." So...
    It's when this happens:

    UserA UserB Orig File Open $orig Original Content | Reads $orig Opens $orig | | Modify $orig in Memory Reads $orig | | Write $orig to FS Modify $orig in Memory UserA Content | Write $orig to FS UserB Content
    There are 2 problems - UserA's changes only last a split second but generally the more important problem is UserB never saw changes UserA made.

    Q3: Is the problem the fact that $new might exist already because another instance of this script running at the same time had created $new a split-second ago in connection with its own update of $old, and that our process will destroy the contents of that $new due to the way ">" works,
    Nope (at least if you use File::Temp). You only have to be concerned about the file has the unchanging name. That is when 'clobbering' occurs.

    Q4: In a multi-user environment, does a careful programmer need to use "sysopen/flock LOCK_EX/truncate" every time a script needs to write a file? And now a final wrinkle on the addition of a file lock for $new in the recipe.
    Depends.

    • If it's really important then yes, you should.
    • If it's not critical and not changed very often, locking is not that critical.
    • If you are reasonably sure that only one instance of one program will be updating the file. The locking is generally not required.

    The flip side is - If your data is important, changed by more than one source, and changed often - Then you should generally use a full database that supports locking. This is why file locking is not a huge problem.

    Q5: Wouldn't we would want to keep $new open (and hence the LOCK_EX in place) until after the "rename( $new, $old )"?
    You're still stuck on $new but, I'll rework your question towards what I think you want to ask. 'When should I be releasing a lock'

    The best strategy IMO is to create a '.lock' file and flock that. Like this:

    • Once your program decides to modify the file 'foo.txt'. Check for a flocked 'foo.lock' file. If you're clean then create a 'foo.lock' and lock it.
    • read 'foo.txt'
    • modify
    • write it to a unique temp file via File::Temp
    • rename temp file to 'foo.txt'
    • delete 'foo.lock'
    This prevents corruption from clobbering and from your program dieing in mid write.


    grep
    One dead unjugged rabbit fish later
Re: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function
by graff (Chancellor) on Nov 03, 2006 at 03:38 UTC
    That's a lot of questions... ;) And I was slow to respond, so I'm mostly reiterating what grep said. But let me back up a bit:
    The file to be modified in my case is an important flat-file database (one record per line); web users can use a CGI script to either add data to the file or to edit their own records; I'm concerned about possible file corruption when two or more users are submitting new or revised data at about the same instant.

    In that sort of scenario, there are a couple things to watch out for:

    • There's no locking. Bob pulls the data into his browser at 10:00, spends 15 minutes figuring out how to change it, then uploads his version. Meanwhile, Joe pulls the data at 10:05, spends 5 minutes working on his update, and uploads it. As of 10:15, Joe's updates are lost forever (or until he sees a problem and repeats his work).

    • There is locking, but Joe and Bob actually manage to beat the odds and both their updates hit the sever within a few cpu cycles of each other (relatively speaking); Joe's thread opens the file for output, then Bob's opens it for output, then Joe tries to get the lock on the file, then... ouch! my brain!!

    Obviously, the first scenario is the one you really should worry about. It's not just a matter of using flock on the file; in fact, the more I think about it, the more unsuitable flock seems to be for web-based stuff. If you solve the first problem, the second one is a moot point.

    As the first reply points out, you need some sort of "check-out/check-in" mechanism to keep different users from stepping on each other's updates. A user needs to explicitly request write access to the data file, and when your cgi script services that request, it has to know whether someone else has already been given write access.

    And that's where you need to resolve any possible race condition: any given thread either gets the access (thereby blocking others), or else fails to do so because it is currently granted to someone else. For this purpose, checking for the existence of some "access.lock" file and creating it if it does not exist is almost atomic enough -- something like:

    my $fh = undef; ( -e "access.lock" || open( $fh, ">", "access.lock" )); if ( not defined( $fh )) { # report that someone else is editing the file } else { # write client/session-id data to access.lock and close it # so you can verify when this client sends the update }
    (The truly paranoid programmer will find a chink there, and will hopefully offer the correct way to seal it up tight.)

    But web interactions being what they are, you also need a policy: some upper bound on how long a client may hold the access lock. If Bob does a check-out at 10:00am and tries to upload his update at 10:00pm, it might be prudent to tell him at that point that he waited to long to submit the update and please try again using a fresh download (and please try to return it more quickly).

    Or the policy could be more flexible: client may keep the lock up at least N minutes, or until someone else requests the lock after the minimum N minutes have passed -- that is, another client can "steal" the lock if it's more than N minutes old.

    I know I could use a real database but I really want to figure out file locking using Perl. Seems like this issue must come up all the time in a multiuser environment, whether web or internal network.

    It's good to make sure you understand file locking, even if it doesn't exactly apply to the current task. And yes, it's an old topic. Consider this old node, drawn from an even older article by Sean Burke, published in The Perl Journal back in 2001 (and sadly hard to find these days). Meanwhile, get started on using a real database for your current web app.

Re: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function
by fmerges (Chaplain) on Nov 03, 2006 at 01:14 UTC

    Hi,

    If you write all the programs that write to the files, you can use locks, but take care, file locks is not a real restriction, you can ignore it if you want

    The temporary file, is the $new one, which after doing the stuff you need to do is renamed to the same name as the old. This is not very clean, I would use some control version system, the well known RCS would be enough, you don't need CVS or SVN for simple stuff.

    For this kind of problem, take a look at some wiki, for example kwiki, they needed also to solve this issue, mean, more than one client wants to make an update.

    Here you can read more info about file locking with perl.

    Regards,

    fmerges at irc.freenode.net

      If you write all the programs that write to the files, you can use locks, but take care, file locks is not a real restriction, you can ignore it if you want

      Not necessarily. Some OSes (Windows, for example) have mandatory locks (as opposed to advisory locks).

        Even the "mandatory" locks on Windows' aren't infallible when you get into network shared filesystems, especially SMB/Samba connections, because the whole modification stack isn't under one machine's control. You are right: they profess to be mandatory and cause you grief if you ignore them, but they cause you grief anyway when the remote system does unexpected things.

        --
        [ e d @ h a l l e y . c c ]

        Hi,

        You're right, but I wasn't talking in general sense, 'cause the code snippet pasted was written in some UN*X... ;-)

        Regards,

        fmerges at irc.freenode.net
Re: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function
by jbert (Priest) on Nov 03, 2006 at 09:05 UTC
    It's nowhere near as bad as the Pounding a nail, shoe or glass bottle article, but I am reminded of it a little. However, I would respectfully suggest that you consider using a different tool for this job, especially since you say the file is important.

    The tool which seems most suitable to me is SQLite - it essentially offers all (most) of the features of a server-based database, but just works on a file like you have - doing all the locking and concurrency you need, on Unix/Linux or Windows.

    The overhead for you will be:

    1. Converting your existing data to a SQL table
    2. Learning enough SQL to read and update the db (unless you know SQL already)
    but then you'll have something which will "just work".

    And the SQL isn't hard, and you can play at the sqlite prompt:

    $ sqlite3 foo.db SQLite version 3.2.8 Enter ".help" for instructions sqlite> create table players (id integer, name varchar(256), score int +eger); sqlite> insert into players (id, name, score) values (1, "bob", 100); sqlite> insert into players (id, name, score) values (2, "sally", 200) +; sqlite> insert into players (id, name, score) values (3, "frank", 10); sqlite> select * from players; 1|bob|100 2|sally|200 3|frank|10 sqlite> select score from players where name='sally'; 200 sqlite> update players set score=score+10 where name='sally'; sqlite> select score from players where name='sally'; 210 sqlite> delete from players where name='bob'; sqlite> select * from players; 2|sally|210 3|frank|10 sqlite>
    There is a little trickery in the update command. Rather than read the info in one SQL SELECT and then store it as a seperate UPDATE, I did the read+update in one statement.

    This is because another process could have got in and altered the score between my read and my update. There is no chance of that happening if done with one statement.

    If you want to get deeper into SQL you can get around this with transactions and/or locks, but the above is probably be all you need.

    You can get very complicated with SQL if you need/want to, and in fact I've stuck in a (redundant?) id number in there out of habit - which I could have told the system was an indexed, unique key. But all of that stuff is really for bigger systems where you're giving hints to help performance and similar. If you're currently using a flat file you're probably not near worrying about that yet.

      Wow-- thanks very much for the link to the article. That's my situation exactly, with respect to the use of a "flat-file database" for a web app-- I need to go to the toolbox and get the right tool instead of using a file.

      I put too much emphasis on the Cookbook's statement that the recipe is the best way for modifying a file in place. Instead of trying to figure out how to add the file locking that the authors recommend to improve the recipe even further, I should have recognized that modifying a file in place is not the right recipe to solve my problem of creating a database that correctly handles updates.

      Thanks for not flogging me <g>

        Good luck with the SQL.

        And, to keep things perlish...

        You may already know all this, but there are a *lot* of perl approaches to accessing databases. At their base, they all use 'DBI'. That defines the interface and DBD::xxx module provides the back-end which talks to the database.

        There are an abundance of modules to layer on top of these if you choose (DBIx::Class, Class::DBI and others), which can avoid you having to actually use SQL. I'm not sure I'd recommend these if your needs are really simple. But have a play and see what suits you best. There is also plenty of DB-related stuff in the perlmonks [id://Tutorials] section.

Re: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function
by cdarke (Prior) on Nov 03, 2006 at 07:59 UTC
    I'm concerned that the rename function implicitly closes the file being renamed and breaks the lock on it before doing something as drastic as renaming it.
    rename is an operation on the directory, not on the file. This is why you need write access to a directory to rename a file - no access is required to the file itself (this applies to UNIX).
      As a side-note, it is rename() that does not work across file systems. The "mv" command itself actually acts as a wrapper when the source / destination are on different file systems.

      Depending on the UNIX implementation, there are some considerations that may arise when moving a file across filesystems:

      1) The source file is copied to the target filesystem and then deleted. It is roughly equivalent to "rm -f DEST && cp -PRp SRC DEST && rm -rf SRC".
      2) mv must explicitly copy modification/access time, ownership and mode.
      3) on some Unix systems, setuid/setgid permissions are not preserved.
      4) ACLs may or may not be replicated.

      Hence the Cookbook warning on rename() across file systems. The "mv" command actually does work across filesystems on modern UNIX systems, since it is a requirement of IEEE Std 1003.1-2001.

      Regards,
      Niel

Re: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function
by Anonymous Monk on Nov 03, 2006 at 08:40 UTC

    Q1: In "The truly paranoid programmer would lock the file", which file are the authors referring to?

    $old.

    Q2. ...

    I would do something like:

    use Fcntl qw(:DEFAULT :flock :seek); my $old = "..."; my $new = "..."; # Assume it's safe to clobber this file. my ($o_fh, $n_fh); # # The sysopen doesn't use the O_CREAT flag. If you the # task is to *change* $old, it should be an error if the # file doesn't exist. # sysopen($o_fh, $old, O_RDWR) or die "..."; flock($o_fh, LOCK_EX) or die "..."; # Now we have an exclusive lock. # # Open $new in read-write mode. Create if necessary. # sysopen($n_fh, $new, O_RDWR | O_CREAT) or die "..."; # # Get rid of any existing data in the file. # truncate $n_fh, 0 or die "..."; while (<$o_fh>) { ... print $n_fh $_; } # # Go back to beginning of files. # seek $n_fh, 0, SEEK_SET or die "..."; seek $o_fh, 0, SEEK_SET or die "..."; while (<$n_fh>) { print $o_fh $_; } # # Truncate any remaining garbage. # truncate $o_fh, tell $o_fh or die "..."; close $o_fh or die "...";
    Note that you only have to go this way if $old is a large file. Otherwise, you could just suck in the content of $old, modify it, and write it back, without going over the trouble of using $new. You still need to flock $old and do the truncate though.

    Q3: Is the problem the fact that $new might exist already because another instance of this script running at the same time had created $new a split-second ago in connection with its own update of $old, and that our process will destroy the contents of that $new due to the way ">" works, thereby causing the other instance (e.g., another web user submitting data via the same page's form) to produce mangled or empty data when that instance renames $new to $old? Yikes, there goes the database.

    If you get an exclusive lock on $old before opening $new, and don't let the lock go (note that closing the file lets the lock go), there's no problem.

    Q4: In a multi-user environment, does a careful programmer need to use "sysopen/flock LOCK_EX/truncate" every time a script needs to write a file? And now a final wrinkle on the addition of a file lock for $new in the recipe.

    No. Only if it's possible more than one process (or thread) that tries to modify the file.

    Q5...

    The question uses a wrong premisis. It's not $new that needs to be locked - it's $old that needs to be locked.

Re: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function
by ftumsh (Scribe) on Nov 03, 2006 at 12:10 UTC
    wrt linux, locks are advisory. ie even if you lock the file it doesn't keep other processes out of the file. I don't think it matters in your particular case, but one must bear this in mind and check that the particular file one is wanting to lock has no other processes writing to it. eg, an ftp process is downloading the file and your process wants to write to the file (for some reason). If the ftp doesn't lock the file you will have to use "fuser" to test if the file has no processes attached to it. This will require root privileges, probably, if the writing proces is a different user to the reading one...