bennymack:
While sgifford offers a good solution, I thought I'd offer up one I use periodically. I often have a "watchdog" process running whose only job is to send EMails when it detects that something has gone awry. This has the added benefit of catching (some) scripts that hang in infinite loops.
My usual method is to have the simple watchdog just watch a directory. If it ever notices a file over X seconds old, it EMails it to you. So any script that wants to take advantage of it may simply put a formatted message in the watched directory, and "touch" it periodically to ensure that it's not too old. If the script dies, watchdog sends it to you when it ages out. If the script hangs without touching the file, the watchdog will still send you the EMail. But you're still scrood if the script hangs in a loop continuously touching the file.
I don't have the exact code in front of me at the moment, but it's something like this:
#!/usr/bin/perl -w
use warnings;
use strict;
###############
# CONFIG VARS #
###############
my $dir_to_watch = "/cygdrive/c/WATCHDOG";
my $timeout = 120; # Max file age (sec)
my $awhile = 30; # Checking interval (sec)
#--------------------------------------------------
# Sends contents of specified file to admin
sub send_msg {
my $fn = shift;
print "You'd send $fn via EMail here...\n";
}
chdir $dir_to_watch;
while (1) {
for (`ls`) {
chomp;
my $age = time - (stat)[9];
send_msg($_) if $age > $timeout;
}
sleep $awhile;
}
(The code above is tested, except it doesn't send EMail. Plug appropriate code into send_msg.)
Then my programs (C++, etc.) and scripts use it by formatting an appropriate message to deliver on failure, and update it periodically, like:
#!/usr/bin/perl -w
use warnings;
use strict;
###############
# CONFIG VARS #
###############
my $awhile = 45; # Checking interval (sec)
my $my_watched_file = "/cygdrive/c/WATCHDOG/foobar";
my $EMailText=
'To: roboticus@a.fake.domain.com
Subject: JobToMonitor.pl fault!
Yecch!
';
#--------------------------------------------------
# Let watchdog know we're alive...
sub still_alive {
open OF, '>>', $my_watched_file;
print OF $EMailText;
close OF;
}
unlink $my_watched_file;
my $next_time = 0;
my $count = 0;
while ($count < 999999999999) {
# put this in a part that's likely to be OK
if (time > $next_time) {
&still_alive;
$next_time = time + $awhile;
}
# smallish chunks of job that shouldn't take
# too much time
++$count;
# You can even use it for periodic logging!
$EMailText = "$count reached at " . (localtime) . "\n"
if $count % 1000000 == 0;
}
# Don't alert if we complete successfully
unlink $my_watched_file;
Yeah, it's admittedly contrived, but it's a handy thing for servers running lots of odd jobs.
--roboticus
|