Best method for failure recovery?

Maestro_007 has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to implement a "recover" option into a script that may fail at any of a few given points. Currently we need to rerun the whole thing in the event of a failure, but it's a big data warehousing project, and as the size of the data files grows, the importance of recovery will become more and more important.

I prototyped a system that seems to be pretty reliable, but it makes use of the dreaded "goto". I think that's okay, that it is in a very controlled circumstance and won't wreak havoc, but I wanted to pose this to the group at large: what is the best method to insert "checkpoints", that leave behind the ability to resume a task at the point where it failed?

I've attached two scripts: the first is the main script that performs three tasks. Any time one of these tasks fails, a file is left behind, ensuring that the process can be picked up where it left off. The other is a "wrapper" script, which calls either the recovery file or the whole process from the beginning.

NOTE: There are a few housekeeping things that this obviously doesn't do, e.g. checking for a valid code label, etc. I left them out intending to do that when it comes time for real implementation.

The main script, recover.pl

#!/usr/local/bin/perl
use Getopt::Std;
our ($opt_r);
getopts('r:');

my $recover             = uc $opt_r;
$recover and goto ($recover);

my $rec_file_name       = 'start.rec';

FIRST:          first();
SECOND:         second();
THIRD:          third();
CleanUp();


# do the first task
sub first
{
        # some other stuff may happen here, 
        #  so we'll go ahead and write the recovery info
        write_recovery_file('first');
        print STDERR "This is first\n";
        sleep(1);
#       die "died in first";
}

sub second
{
        write_recovery_file('second');
        print STDERR "This is second\n";
        sleep(1);
#       die "died in second";
}

sub third
{
        write_recovery_file('third');
        print STDERR "This is third\n";
        sleep(1);
#       die "died in third";
}


sub CleanUp
{
        `rm start.rec`;
        print "Recovery file deleted\n" unless ($?);
}

sub write_recovery_file
{
        my $str = shift;
        open RECOVER, ">$rec_file_name";
        print RECOVER "$0 -r$str\n";
        close RECOVER;
}
[download]

And here's the wrapper script:

#!/usr/local/bin/perl
# This tests the recovery system

$recovery_file = check_for_recovery_file();

# execute the script at either the recovery step or the beginning
$recovery_file ? recover($recovery_file) : recover();

sub recover
{
        my $file_name = shift;
        my $cmd_line = 'recover.pl';
        if ($file_name)
        {
                open INFILE, "$file_name";
                # assumes the recovery file contains 
                #  no more than one line of text
                chomp($cmd_line = <INFILE>);    
                print STDERR "Resuming failed process: '$cmd_line'";
        }
        `$cmd_line`;
        die "Could not execute $file_name: $!" if ($?);
        print STDERR "'$cmd_line' successful\n";
}

sub check_for_recovery_file
{
        # won't hard-code this in real life
        $_ = 'start.rec';
        (-s) ? return $_ : return 0
}
[download]

Any thoughts?

Comment on Best method for failure recovery? Select or Download Code

Replies are listed 'Best First'.
Re: Best method for failure recovery? by dragonchild (Archbishop) on Sep 19, 2001 at 22:25 UTC
Instead of gotos, you could just use an array of subrefs. `my @dispatch = ( \&first, \&second, \&third, ); my $start = 0; $start = $recover if $recover > $start; foreach my $index ($start .. $#dispatch) { &{$dispatch[$index]}; }` [download] ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.	[reply] [d/l]
Re: Re: Best method for failure recovery? by Maestro_007 (Hermit) on Sep 19, 2001 at 22:36 UTC
That method had occurred to me, but in a way I think the problem would be similar from a maintenance perspective. On the one hand, it doesn't use a `goto`, but on the other hand, it uses symbolic (Update: Ack! they're not symbolic, I just didn't read it right! thanks!!)coderefs. If the future maintainer isn't a perl guy (and in some cases, even if he is), he'll have a much better chance of cursing my name but still getting through it if there's something as hated and feared, but at least known as a `goto`, than if I stick him with something where the only clue is `no strict 'refs'` and some good comments. Still, for validation of the parameter, your solution is much easier and more reliable. There is a "guaranteed" set of steps in a "guaranteed" order, and you can't just jump to any old arbitrary place in the script. From that point of view, I may go with it instead. thanks! MM	[reply] [d/l] [select]
Re3: Best method for failure recovery? by dragonchild (Archbishop) on Sep 19, 2001 at 22:55 UTC
Actually, the list of coderefs does not use symbolic references. Using symbolic references would be something like`&{"$recovery};`. What I do is use hard references instead. And, as always, you should comment any use of an advanced feature, such as coderefs, if you expect your code to be maintained by people who know less than you do. This is for every language, not just Perl. ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.	[reply] [d/l]
Re: Best method for failure recovery? by derby (Abbot) on Sep 19, 2001 at 22:49 UTC
MM, Sounds like a good use for eval. You could even pull everything into one script: `$step_1 = 0; $step_2 = 0; $step_3 = 0; $working = 1; while( $working ) { eval { $step_1 \|\| first(); $step_2 \|\| second(); $step_3 \|\| third(); clean_up(); }; if( $@ ) { print STDERR $@; } else { $working = 0; } }` [download] update Err, make that $step_1 \|\|= first() ... -derby	[reply] [d/l]
Re (tilly) 1: Best method for failure recovery? by tilly (Archbishop) on Sep 20, 2001 at 16:50 UTC
Well I would suggest working it like this. Divide the large job into a series of more managable tasks which have dependencies between them. Arrange to set up each task as an item that can be restarted at any point within the task without having damanged the ability of the task to go forward. (Basically this means writing each task such that it doesn't wipe out its initial data, and can clean up or overwrite the previous partial run.) Then set up a control table with the open tasks. In that control table you mark tasks that need to run, mark them as being run, run them, then mark them as done. Now your script can be re-run as many times as you want, and will skip work that was already done. In fact you can even have your script do as much work as feasible on each run, skipping any trouble spots, so that after a human sees it the bulk of the work got done despite any issues. Plus as a bonus if you do this carefully you may get out of it the ability to run your script simultaneously on several machines. I attest from personal experience that while writing everything in this fashion can be a lot of work, some small steps towards the control table and distinct transactions idea does a lot towards simplifying your overall program and making it capable of handling all sorts of complex failure modes robustly. (Something doesn't look right? Abort, send notification, then continue with other stuff it can do!) I can also attest from personal experience that the various goto solutions offered remind me of some really bad systems that I have worked with. Sure, if you do everything just right, it might work. But it is inherently a fragile approach and leads to fragile code. Not what I want in a production system! (And no, I have not merely heard vague rumor that goto is a bad idea. Give me credit for having done more homework on the topic than that.)	[reply]
Re: Best method for failure recovery? by tommyw (Hermit) on Sep 20, 2001 at 12:50 UTC
What's wrong with: `my $recover = $opt_r; first() unless $recover>1; second() unless $recover>2;` [download] etc. And then simply writing a numeric value to the recovery file. I'd hope this was readable by anybody who knows English, without needing to know perl at all. Incidentally, with your setup, I think that you're not going to set $rec_file_name, if your recovery flag is set, and since you've declared it with my, it's not going to be visible within the subroutines anyway. Of course, you'd find this when writing production code: remember -w and strict are your friends :-)	[reply] [d/l]
Re: Best method for failure recovery? by MZSanford (Curate) on Sep 20, 2001 at 13:12 UTC
This is a problem i have faced many times, and ended up writting some really odd code to reuse. But, most of the time, i found `goto` to be a fine solution. There will be people who will down vote this because the "dislike" `goto`, though i would guess they have never used it, and have only read that it is bad. But, `goto` does require a moments thought. What i have done is something like the following (untested): `use Getopt::Long; my $RECOV_STEP = 'FTP_FILE'; my $optre = &GetOptions("-restart=s" => \$RECOV_STEP); if ($optre == 0) { print "Invalid Option Processing\n"; &usage(); } eval { goto $RECOV_STEP; }; if ($@) { die "Invalid Recovery Step '$RECOV_STEP'\n" }; FTP_FILE: { ### get data print "FTP Step Started\n"; }; LOAD_FILE: { ### load database with file print "Load Step Started\n"; }; CLEANUP: { ### archive data print "Cleanup Step Started\n"; }; sub usage { print "$0 -restart <STEP>\n"; }` [download] I have worked with production operation groups for some time now, and have found that they seem to prefer named restart "steps" as opposed to numbers. <minirant>While this may be diffrent where you are, unless you want to be the person up as night on the phone telling them the numbers, i suggest named steps and a good document.</minirant> _{my own worst enemy} -- MZSanford	[reply] [d/l] [select]