Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function

davebaker has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, monks!

I would like to modify a file in place using the method recommended as the "best" method in the Perl Cookbook, 2d ed., recipe 7.15, Modifying a File in Place with a Temporary File. I don't understand something the authors say; it seems critical to understand it, though.

(The file to be modified in my case is an important flat-file database (one record per line); web users can use a CGI script to either add data to the file or to edit their own records; I'm concerned about possible file corruption when two or more users are submitting new or revised data at about the same instant. I know I could use a real database but I really want to figure out file locking using Perl. Seems like this issue must come up all the time in a multiuser environment, whether web or internal network.)

The code provided in the recipe is:

open( OLD, "<", $old )         or die "can't open $old: $!";
open( NEW, ">", $new )         or die "can't open $new: $!";

while (<OLD>) {
    # change $_, then ...
    print NEW $_               or die "can't write $new: $!";
}

close( OLD )                   or die "can't close $old: $!";
close( NEW )                   or die "can't close $new: $!";

rename( $old, "$old.orig" )    or die "can't rename $old to $old.orig:
+ $!";
rename( $new, $old )           or die "can't rename $new to $old: $!";
[download]

Some discussion follows, then the authors say:

Note that rename won't work across filesystems, so you should create your temporary file in the same directory as the file being modified.
The truly paranoid programmer would lock the file during the update. The tricky part is that you have to open the file for writing without destroying its contents before you can get a lock to modify it. Recipe 7.18 shows how to do this.

(Emphasis supplied by me.)

Q1: In "The truly paranoid programmer would lock the file", which file are the authors referring to?

Q2: Regarding the reason for being "truly paranoid" -- is this because we don't want another running instance of this script to be writing to $new while we are, so we ought to revise this script (and hence both instances) to get a LOCK_EX before writing to $new?

To get the desired file lock, the authors caution that the "tricky part" is to first open the file for writing without clobbering its contents. I have read elsewhere in the book that "open (OUT, ">", $out)" would "clobber" any existing file named $out before a script would have a chance to get a lock on the file, and I've read (p. 421 of Programming Perl, 3d ed.) that the best method for writing to a file is to use sysopen, which does not clobber any file that exists, as in:

use Fcntl qw( :flock :DEFAULT );
sysopen( OUT, $out, O_WRONLY|O_CREAT ) or die "can't sysopen $out: $!"
+;
flock( OUT, LOCK_EX )                  or die "can't flock $out: $!";
truncate( OUT, 0)                      or die "can't truncate $out: $!
+";
# now write to file...
close( OUT )                           or die "can't close $out: $!";
[download]

Q3: I'm not sure I completely understand the hazards of "clobbering." Is the problem the fact that $new might exist already because another instance of this script running at the same time had created $new a split-second ago in connection with its own update of $old, and that our process will destroy the contents of that $new due to the way ">" works, thereby causing the other instance (e.g., another web user submitting data via the same page's form) to produce mangled or empty data when that instance renames $new to $old? Yikes, there goes the database.

Q4: In a multi-user environment, does a careful programmer need to use "sysopen/flock LOCK_EX/truncate" every time a script needs to write a file? If a plain open ">" technique is used there would seem to be a potential clobbering problem.

Q5: A final wrinkle on the addition of a file lock for $new in the recipe: wouldn't we would want to keep $new open (and hence the LOCK_EX in place) until after the "rename( $new, $old )"? Would that work, though? I'm concerned that the rename function implicitly closes the file being renamed and breaks the lock on it before doing something as drastic as renaming it.

Comment on Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function Select or Download Code

Replies are listed 'Best First'.

Re: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function
by grep (Monsignor) on Nov 03, 2006 at 02:25 UTC

Q1: In "The truly paranoid programmer would lock the file", which file are the authors referring to?

Q2: Regarding the reason for being "truly paranoid" -- is this because we don't want another running instance of this script to be writing to $new while we are,
Nope, You're stuck on the $new temp file when the $old original file is what you should be concerned about. You should be using File::Temp to get a uniquely named $new temp file.

I'm not sure I completely understand the hazards of "clobbering." So...
It's when this happens:

UserA                    UserB                    Orig File
Open $orig                                        Original Content
                                                          |
Reads $orig              Opens $orig                      |
                                                          |
Modify $orig in Memory   Reads $orig                      |
                                                          |
Write $orig to FS        Modify $orig in Memory     UserA Content
                                                          |
                         Write $orig to FS          UserB Content
[download]

Q3: Is the problem the fact that $new might exist already because another instance of this script running at the same time had created $new a split-second ago in connection with its own update of $old, and that our process will destroy the contents of that $new due to the way ">" works,
Nope (at least if you use File::Temp). You only have to be concerned about the file has the unchanging name. That is when 'clobbering' occurs.

Q4: In a multi-user environment, does a careful programmer need to use "sysopen/flock LOCK_EX/truncate" every time a script needs to write a file? And now a final wrinkle on the addition of a file lock for $new in the recipe.
Depends.

If it's really important then yes, you should.
If it's not critical and not changed very often, locking is not that critical.
If you are reasonably sure that only one instance of one program will be updating the file. The locking is generally not required.

Q5: Wouldn't we would want to keep $new open (and hence the LOCK_EX in place) until after the "rename( $new, $old )"?
You're still stuck on $new but, I'll rework your question towards what I think you want to ask. 'When should I be releasing a lock'

The best strategy IMO is to create a '.lock' file and flock that. Like this:

Once your program decides to modify the file 'foo.txt'. Check for a flocked 'foo.lock' file. If you're clean then create a 'foo.lock' and lock it.
read 'foo.txt'
modify
write it to a unique temp file via File::Temp
rename temp file to 'foo.txt'
delete 'foo.lock'

grep

One dead unjugged rabbit fish later

[reply]
[d/l]

Re: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function
by graff (Chancellor) on Nov 03, 2006 at 03:38 UTC

grep

The file to be modified in my case is an important flat-file database (one record per line); web users can use a CGI script to either add data to the file or to edit their own records; I'm concerned about possible file corruption when two or more users are submitting new or revised data at about the same instant.

In that sort of scenario, there are a couple things to watch out for:

There's no locking. Bob pulls the data into his browser at 10:00, spends 15 minutes figuring out how to change it, then uploads his version. Meanwhile, Joe pulls the data at 10:05, spends 5 minutes working on his update, and uploads it. As of 10:15, Joe's updates are lost forever (or until he sees a problem and repeats his work).
There is locking, but Joe and Bob actually manage to beat the odds and both their updates hit the sever within a few cpu cycles of each other (relatively speaking); Joe's thread opens the file for output, then Bob's opens it for output, then Joe tries to get the lock on the file, then... ouch! my brain!!

Obviously, the first scenario is the one you really should worry about. It's not just a matter of using flock on the file; in fact, the more I think about it, the more unsuitable flock seems to be for web-based stuff. If you solve the first problem, the second one is a moot point.

As the first reply points out, you need some sort of "check-out/check-in" mechanism to keep different users from stepping on each other's updates. A user needs to explicitly request write access to the data file, and when your cgi script services that request, it has to know whether someone else has already been given write access.

And that's where you need to resolve any possible race condition: any given thread either gets the access (thereby blocking others), or else fails to do so because it is currently granted to someone else. For this purpose, checking for the existence of some "access.lock" file and creating it if it does not exist is almost atomic enough -- something like:

  my $fh = undef;
  ( -e "access.lock" || open( $fh, ">", "access.lock" ));
  if ( not defined( $fh )) {
      # report that someone else is editing the file
  } else {
      # write client/session-id data to access.lock and close it
      # so you can verify when this client sends the update
  }
[download]

But web interactions being what they are, you also need a policy: some upper bound on how long a client may hold the access lock. If Bob does a check-out at 10:00am and tries to upload his update at 10:00pm, it might be prudent to tell him at that point that he waited to long to submit the update and please try again using a fresh download (and please try to return it more quickly).

Or the policy could be more flexible: client may keep the lock up at least N minutes, or until someone else requests the lock after the minimum N minutes have passed -- that is, another client can "steal" the lock if it's more than N minutes old.

I know I could use a real database but I really want to figure out file locking using Perl. Seems like this issue must come up all the time in a multiuser environment, whether web or internal network.

It's good to make sure you understand file locking, even if it doesn't exactly apply to the current task. And yes, it's an old topic. Consider this old node, drawn from an even older article by Sean Burke, published in The Perl Journal back in 2001 (and sadly hard to find these days). Meanwhile, get started on using a real database for your current web app.

[reply]
[d/l]

Re: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function
by fmerges (Chaplain) on Nov 03, 2006 at 01:14 UTC

Hi,

If you write all the programs that write to the files, you can use locks, but take care, file locks is not a real restriction, you can ignore it if you want

The temporary file, is the $new one, which after doing the stuff you need to do is renamed to the same name as the old. This is not very clean, I would use some control version system, the well known RCS would be enough, you don't need CVS or SVN for simple stuff.

For this kind of problem, take a look at some wiki, for example kwiki, they needed also to solve this issue, mean, more than one client wants to make an update.

Here you can read more info about file locking with perl.

Regards,

fmerges at irc.freenode.net

[reply]

Re^2: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function

by ikegami (Patriarch) on Nov 03, 2006 at 04:34 UTC

If you write all the programs that write to the files, you can use locks, but take care, file locks is not a real restriction, you can ignore it if you want

Not necessarily. Some OSes (Windows, for example) have mandatory locks (as opposed to advisory locks).

[reply]

Re^3: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function

by halley (Prior) on Nov 03, 2006 at 14:16 UTC

Even the "mandatory" locks on Windows' aren't infallible when you get into network shared filesystems, especially SMB/Samba connections, because the whole modification stack isn't under one machine's control. You are right: they profess to be mandatory and cause you grief if you ignore them, but they cause you grief anyway when the remote system does unexpected things.

--
[ e d @ h a l l e y . c c ]

[reply]

Re^3: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function

by fmerges (Chaplain) on Nov 03, 2006 at 13:58 UTC

Hi,

You're right, but I wasn't talking in general sense, 'cause the code snippet pasted was written in some UN*X... ;-)

Regards,

fmerges at irc.freenode.net

[reply]

Re: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function
by jbert (Priest) on Nov 03, 2006 at 09:05 UTC

Pounding a nail, shoe or glass bottle

The tool which seems most suitable to me is SQLite - it essentially offers all (most) of the features of a server-based database, but just works on a file like you have - doing all the locking and concurrency you need, on Unix/Linux or Windows.

The overhead for you will be:

Converting your existing data to a SQL table
Learning enough SQL to read and update the db (unless you know SQL already)

And the SQL isn't hard, and you can play at the sqlite prompt:

$ sqlite3 foo.db
SQLite version 3.2.8
Enter ".help" for instructions
sqlite> create table players (id integer, name varchar(256), score int
+eger);
sqlite> insert into players (id, name, score) values (1, "bob", 100);
sqlite> insert into players (id, name, score) values (2, "sally", 200)
+;
sqlite> insert into players (id, name, score) values (3, "frank", 10);
sqlite> select * from players;
1|bob|100
2|sally|200
3|frank|10
sqlite> select score from players where name='sally';
200
sqlite> update players set score=score+10 where name='sally';
sqlite> select score from players where name='sally';
210
sqlite> delete from players where name='bob';
sqlite> select * from players;
2|sally|210
3|frank|10
sqlite>
[download]

This is because another process could have got in and altered the score between my read and my update. There is no chance of that happening if done with one statement.

If you want to get deeper into SQL you can get around this with transactions and/or locks, but the above is probably be all you need.

You can get very complicated with SQL if you need/want to, and in fact I've stuck in a (redundant?) id number in there out of habit - which I could have told the system was an indexed, unique key. But all of that stuff is really for bigger systems where you're giving hints to help performance and similar. If you're currently using a flat file you're probably not near worrying about that yet.

[reply]
[d/l]

Re^2: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function

by davebaker (Pilgrim) on Nov 03, 2006 at 15:37 UTC

I put too much emphasis on the Cookbook's statement that the recipe is the best way for modifying a file in place. Instead of trying to figure out how to add the file locking that the authors recommend to improve the recipe even further, I should have recognized that modifying a file in place is not the right recipe to solve my problem of creating a database that correctly handles updates.

Thanks for not flogging me <g>

[reply]

Re^3: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function

by jbert (Priest) on Nov 03, 2006 at 17:30 UTC

And, to keep things perlish...

You may already know all this, but there are a *lot* of perl approaches to accessing databases. At their base, they all use 'DBI'. That defines the interface and DBD::xxx module provides the back-end which talks to the database.

There are an abundance of modules to layer on top of these if you choose (DBIx::Class, Class::DBI and others), which can avoid you having to actually use SQL. I'm not sure I'd recommend these if your needs are really simple. But have a play and see what suits you best. There is also plenty of DB-related stuff in the perlmonks [id://Tutorials] section.

[reply]

Re: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function
by cdarke (Prior) on Nov 03, 2006 at 07:59 UTC

I'm concerned that the rename function implicitly closes the file being renamed and breaks the lock on it before doing something as drastic as renaming it.

[reply]

Re^2: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function

by 0xbeef (Hermit) on Nov 03, 2006 at 12:01 UTC

Depending on the UNIX implementation, there are some considerations that may arise when moving a file across filesystems:

1) The source file is copied to the target filesystem and then deleted. It is roughly equivalent to "rm -f DEST && cp -PRp SRC DEST && rm -rf SRC".
2) mv must explicitly copy modification/access time, ownership and mode.
3) on some Unix systems, setuid/setgid permissions are not preserved.
4) ACLs may or may not be replicated.

Hence the Cookbook warning on rename() across file systems. The "mv" command actually does work across filesystems on modern UNIX systems, since it is a requirement of IEEE Std 1003.1-2001.

Regards,
Niel

[reply]

Re: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function
by Anonymous Monk on Nov 03, 2006 at 08:40 UTC

Q1: In "The truly paranoid programmer would lock the file", which file are the authors referring to?

$old.

Q2. ...

I would do something like:

use Fcntl qw(:DEFAULT :flock :seek);
my $old = "...";
my $new = "...";  # Assume it's safe to clobber this file.
my ($o_fh, $n_fh);
#
# The sysopen doesn't use the O_CREAT flag. If you the 
# task is to *change* $old, it should be an error if the 
# file doesn't exist.
#
sysopen($o_fh, $old, O_RDWR) or die "...";
flock($o_fh, LOCK_EX) or die "...";
# Now we have an exclusive lock.
#
# Open $new in read-write mode. Create if necessary.
#
sysopen($n_fh, $new, O_RDWR | O_CREAT) or die "...";
#
# Get rid of any existing data in the file.
#
truncate $n_fh, 0 or die "...";
while (<$o_fh>) {
    ...
    print $n_fh $_;
}
#
# Go back to beginning of files.
#
seek $n_fh, 0, SEEK_SET or die "...";
seek $o_fh, 0, SEEK_SET or die "...";
while (<$n_fh>) {
    print $o_fh $_;
}
#
# Truncate any remaining garbage.
#
truncate $o_fh, tell $o_fh or die "...";
close $o_fh or die "...";
[download]

Q3: Is the problem the fact that $new might exist already because another instance of this script running at the same time had created $new a split-second ago in connection with its own update of $old, and that our process will destroy the contents of that $new due to the way ">" works, thereby causing the other instance (e.g., another web user submitting data via the same page's form) to produce mangled or empty data when that instance renames $new to $old? Yikes, there goes the database.

If you get an exclusive lock on $old before opening $new, and don't let the lock go (note that closing the file lets the lock go), there's no problem.

No. Only if it's possible more than one process (or thread) that tries to modify the file.

Q5...

The question uses a wrong premisis. It's not $new that needs to be locked - it's $old that needs to be locked.

[reply]
[d/l]

Re: Best practices for modifying a file in place: q's about opening files, file locking, and using the rename function
by ftumsh (Scribe) on Nov 03, 2006 at 12:10 UTC

wrt linux, locks are advisory. ie even if you lock the file it doesn't keep other processes out of the file. I don't think it matters in your particular case, but one must bear this in mind and check that the particular file one is wanting to lock has no other processes writing to it. eg, an ftp process is downloading the file and your process wants to write to the file (for some reason). If the ftp doesn't lock the file you will have to use "fuser" to test if the file has no processes attached to it. This will require root privileges, probably, if the writing proces is a different user to the reading one...

[reply]


Welcome to the Monastery
	PerlMonks