Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Perl Best Practices book: is this one a best practice or a dodgy practice?

by eyepopslikeamosquito (Archbishop)
on Sep 03, 2005 at 00:38 UTC ( [id://488824]=perlmeditation: print w/replies, xml ) Need Help??

I'm very much enjoying reading Perl Best Practices. I find myself scribbling notes after almost every page; mainly concrete ideas for improving the quality and especially the maintainability of production Perl code at work. To me, this is the most important Perl book to be published for years, because it helps me sell Perl as a maintainable language to management.

However, the In-situ Arguments ("Allow the same filename to be specified for both input and output") practice described on page 304 in chapter 14 has me scratching my head, for it seems to me to be more a "dangerous practice" than a "best practice".

Here is a test program, derived from the example given in the book:

# The idea is to use the Unix unlink trick to write the # destination file without clobbering the source file # (in the case where the source and destination are the same file). use strict; use warnings; my $source_file = 'fred.tmp'; my $destination_file = $source_file; # Open both filehandles... use Fatal qw( open ); open my $src, '<', $source_file; unlink $destination_file; open my $dest, '>', $destination_file; # Read, process, and output data, line-by-line... while (my $line = <$src>) { print {$dest} transform($line); } # This is my test version of the transform() function; # the sleep is there for convenience in testing what happens # if you interrupt proceedings mid stream by pressing CTRL-C. sub transform { sleep 1; return "hello:" . $_[0]; }

My problems with this code are:

  • Consider what happens if the while loop is interrupted: by power failure, by user pressing CTRL-C, or because the print fails (due to disk full or disk quota exceeded, say). You've just corrupted your file. You've probably lost data. Worse, you don't know you've done it. And when you go to re-run the script after the interruption, you may spend a lot of time trying to figure out what's happened to your data ... That is, this idiom is not "re-runnable".
  • The unlink trick used to avoid clobbering the input file works on Unix, but may not work on other operating systems. In particular, when run on Windows, the above program clobbers the source file. That is, this idiom is not portable.

As discussed in Re-runnably editing a file in place, it seems sounder to first write a temporary file. Once you're sure the temporary file has been written without error (and after the permissions on the temporary are updated to match the original) you then (atomically) rename the temporary file to the original. In that way, if writing the new file is interrupted for any reason, you can simply re-run the program without losing any data.

Please let me know what I've overlooked.

Replies are listed 'Best First'.
Re: Perl Best Practices book: is this one a best practice or a dodgy practice?
by TheDamian (Vicar) on Sep 03, 2005 at 02:54 UTC
    Please let me know what I've overlooked.
    Well, if the suggested idiom doesn't work under Windows, I'd argue that either Windows (or possibly Perl) is broken in that regard. But, of course, that doesn't solve your problem. I guess it's just as well that Chapter 18 suggests testing everything before you deploy it! ;-)

    Although the recommendation in question doesn't claim to solve the problem of rerunnability, as it happens, the second suggested solution--using the IO::Insitu module--does use a back-up strategy to ensure that data is not lost if the program abends.

      ... the second suggested solution--using the IO::Insitu module--does use a back-up strategy to ensure that data is not lost if the program abends.

      True. But it is still not re-runnable. Which makes it dangerous in the hands of naive users who interrupt a program with CTRL-C, then re-run it. If they do that, they may suffer permanent data loss and without being aware of it.

      It seems to me that you can get re-runnability with little extra effort: simply write the temporary file first and only overwrite the original (via atomic rename) after the temporary has been successfully written.

      As a test, I pressed CTRL-C midway through running this test program:

      use strict; use warnings; use IO::InSitu; my $infile_name = 'fred.tmp'; my $outfile_name = $infile_name; my ($in, $out) = open_rw($infile_name, $outfile_name); for my $line (<$in>) { print {$out} transform($line); } # Try pressing CTRL-C while file is being updated. sub transform { sleep 1; return "hello:" . $_[0]; }
      This is what I saw:
      total 20 drwxrwxr-x 2 andrew andrew 4096 Sep 3 14:44 ./ -rw-rw-r-- 1 andrew andrew 0 Sep 3 14:42 fred.tmp -rw-rw-r-- 1 andrew andrew 191 Sep 3 14:42 fred.tmp.bak drwxrwxr-x 11 andrew andrew 4096 Sep 3 14:42 ../ -rw-rw-r-- 1 andrew andrew 288 Sep 3 14:41 tsitu1.pl
      Now, of course, blindly re-running the test program resulted in permanent data loss (an empty fred.tmp file in this example).

      Update: Just to clarify, this problem is broader than the naive user scenario given above and may bite you anytime a script is automatically rerun after an interruption -- a script that is run automatically at boot time, for example.

      Further update: More detail on Win32 rename, related to tye's response below, can now be found at Re^7: Read in hostfile, modify, output.

        Which makes it dangerous in the hands of naive users who interrupt a program with CTRL-C, then re-run it. If they do that, they may suffer permanent data loss and without being aware of it.
        To quote Oscar Wilde's Miss Prism: "What a lesson for him! I trust he will profit by it." ;-)
        It seems to me that you can get re-runnability with little extra effort: simply write the temporary file first and only overwrite the original (via atomic rename) after the temporary has been successfully written.
        The IO::Insitu module could certainly be reworked to operate that way. Except that then would fail to preserve the inode of the original file. :-(. Perhaps I will add an option to allow it to work whichever way (i.e. "inode-preserving" vs "rerunnable") the user prefers.

        Bear in mind though that an "atomic rename" isn't really atomic under most filesystems, so even this approach still isn't going to absolutely guarantee rerunnability.

      Well, if the suggested idiom doesn't work under Windows, I'd argue that either Windows (or possibly Perl) is broken in that regard.

      Good luck getting Bill Gates or p5p to acknowledge that their unlink is broken. ;-) The Perl documentation for the unlink function is annoyingly vague: "Deletes a list of files. Returns the number of files successfully deleted.".

      The POSIX unlink semantics are clearer, if difficult to implement on non Unix systems, as noted in djgpp POSIX unlink specification:

      The POSIX specification requires this removal to be delayed until the file is no longer open. Due to problems with the underlying operating systems, this implementation of unlink does not fully comply with the specs; if the file you want to unlink is open, you're asking for trouble -- how much trouble depends on the underlying OS.

      I might add that the ANSI C remove function does not appear to mandate POSIX semantics: calling remove on an open file on Linux works (a la POSIX unlink), while on Windows it fails.

      BTW, I found a related discussion of atomic file update in the "Atomic in-place edit" section in This Week on perl5-porters (18-24 November 2002). Not sure if anything was resolved, however. (Update: apart from MJD submitting a Bug report for the Universe :-).

      "guess it's just as well that Chapter 18 suggests testing everything before you deploy it! ;-)"

      It is kind of far-fecthed to talk about testing here. The best practice is to find "bugs" as earlier as possible, the later they are found/fixed, the higher the cost is.

      The particular bad practice that we are discussing in this case, can be easily spotted as early as specification review (if the process requires one to document such details), or at least during peer code review. If a simple issue like this can actually pass all the guarding processes and go as far as testing, the processes should be reviewed.

      Update: Read TheDamian's reply, now I have seen that big ;-) Okay, my fault, please ignore my first sentence in this post.

        It is kind of far-fetched to talk about testing here.
        Hmmmmm. Maybe I didn't make the smiley big enough. Let's try again:
        guess it's just as well that Chapter 18 suggests testing everything before you deploy it! ;-)
        The point I was trying to make is not that you should rely on the testing to catch this problem, but that, if you don't catch the problem earlier, the testing phase should still do so.

        In other words, (just exactly as you say) best practices ought to be an integrated process, so even if one practice introduces a problem, another practice should catch it. Which is precisely what my book advocates.

        One hopes that programmers do at least some testing _before_ code review.
Re: Perl Best Practices book: is this one a best practice or a dodgy practice?
by spiritway (Vicar) on Sep 03, 2005 at 02:46 UTC

    I think you're right about that. It seems a risky way to accomplish what can easily be done with a temp file. I feel that unless there is some compelling reason to take such a risk, you're far better off just creating a temporary file and renaming it when you're done. I'm not even sure what the advantage might be to using the same filename for input and output. Just seems like you're asking for trouble. In my experience, trouble always manages to find me without my having to go looking for it.

      I'm not even sure what the advantage might be to using the same filename for input and output.
      I certainly don't suggest (either here or in the book) that you should use the same file for input and output. I merely observe that, if you allow people to independently specify input and output filenames on the command-line, some of them inevitably will use the same name for both files (either intentionally or accidentally). So you have to be aware of the possibility and cope with it somehow.

      In the book, I suggest two solutions, one of which is more efficient but apparently fails on certain obscure operating systems (and, yes, I will definitely update the book to reflect that limitation, as soon as I can).

        if you allow people to independently specify input and output filenames on the command-line, some of them inevitably will use the same name for both files (either intentionally or accidentally).

        Ah, that makes perfect sense now... it's preemptive programming. You're right - someone will eventually do that...

        In that case, might it not be better to make the "best practice" exposition something like this?
        When accepting command-line arguments that specify both an input file and an output file, be aware that eventually someone using your program will specify the same filename for both input and output. If your code looks like this:
        # Standard open for input, open for output code
        then the result is going to be a blank output file! (or whatever it is your program produces given a blank input file). This isn't very friendly to the user. Slightly more userfriendly is to check for this condition in your code:
        if ($infile eq $outfile) { die "Same filename given for both input and output!"; }
        This protects against a user who accidentally uses the same filename for input and output, but doesn't protect against filenames which are different strings but name the same file (for example, "foo" and "./foo", or "foo" and "bar", if "bar" is a symbolic link to "foo")

        However, it may be the case that the occasional user actually does want the output file to replace the input file, much as perl's own -i option does. Certainly, if possible, we should allow this since we want our programs to be useful to the user. In this case: (... and here continue on with the unlink trick, etc.)

        --
        @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
Re: Perl Best Practices book: is this one a best practice or a dodgy practice?
by pg (Canon) on Sep 03, 2005 at 03:29 UTC

    This is definitely not the best practice. One of the most important transformation of programming through the years is that: small tricks are no longer encouraged. Even the good tricks are not encouraged, because we want the programs to be easily understandable, not to say those ones that can cause various troubles.

    The problem in this case is not really about whether the pogram is re-runnable, but about the fact that it corrupts your data. To make it more clear, let's talk more about re-runability

    • The best type of programs can be rerun from the breaking point without any extra manual action, or the minimum of manual action, such as to remove or fix the data that broke the program. Once you fix the bad data, you can rerun the program, and it will pick up from where it failed. None of the data that has been processed will be reprocessed, unless it is logically required so.
    • The okay type is that, the program does not know how to pick up from the broken point, so that you have to take manual actions to not only fix/remove the bad data, but also reverse what had been processed before it broke (much more manual actions than the first type). After all those manual execises, you can run the program again, but most likely it will start from the very beginning of the process, not the break point. If this program is time-consuming, you waste lots of time. This kind of program is called "not re-runnable".
    • The worst type is what we have seen in this case, the problem is much more than whether it is re-runnable - it corrupts your data.
      The problem in this case is not really about whether the program is re-runnable, but about the fact that it corrupts your data.
      Neither of which is what the original recommendation is about. That recommendation, as it appears in "Perl Best Practices" is:
      In-situ Arguments

      Allow the same filename to be specified for both input and output.


      And that most definitely is a best practice.

      The issue raised here is that one of the two solutions suggested in the book is apparently more limited in applicability than I indicate in the text (a deficiency I have already mentioned that I'll be remedying as soon as I can).

      Now it may well be that "Perl Best Practices" would benefit from the addition of an extra guideline; one that suggests using a transactional approach to avoid corrupting data. But that's only peripherally related to this guideline. It should definitely be a separate suggestion, since the problem of data corruption is not unique to opening the same file twice.

        My copy is at home, as is my co-worker's copy. Hopefully I'm not saying something fundamentally stupid here.

        Perhaps the solution to this entire muddle is to have as the first line of commentary after that best practice read:
        Allowing the same name ensures your program will operate correctly when users, inevitably, do specify the same filename. Be Prepared, as the teach Boy Scouts 1.

        1 If you have trouble remembering the Scout's famous motto, seek out the Tom Lehrer recording and have that song play through your head while coding.

        Be Appropriate && Follow Your Curiosity
Re: Perl Best Practices book: is this one a best practice or a dodgy practice?
by TedPride (Priest) on Sep 03, 2005 at 10:17 UTC
    Mmm. The only reason to read the original file line by line would be that it's very large, meaning that data processing time is likely to be significant, meaning in turn that there's an increased risk of the process being interrupted. Given that the process may be interrupted, your script can't be designed so that an interruption loses you large chunks of data. I would agree that in this instance, it's necessary to create a temporary file.
      The only reason to read the original file line by line would be that it's very large...
      Surely not the "only" reason. Here are five other possible reasons, just off the top of my head. You might use a line-by-line approach:
      • to allow the application to be used interactively
      • to allow the application to be used as part of an unbuffered pipeline
      • to minimize memory usage even on small inputs, in an embedded (or otherwise memory-limited) environment
      • to restrict the tranformation to transforming individual lines (perhaps for backwards compatibility with a previous utility)
      • Because the input data is always runtime-generated (or highly volatile in some other way), so rerunnability doesn't matter
      Note that I don't disagree with you that using temporary files is safer in general, if you can afford the costs involved. After all, using temporary files is precisely what IO::Insitu does. I only question your argument that its possible to deduce the necessity for temporary files from the particular I/O characteristics of the original example (or from any piece of undocumented code for that matter).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://488824]
Approved by sk
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2024-03-28 12:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found