http://qs321.pair.com?node_id=431043

K_M_McMahon has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks (some of whom are esteemed),

I came across a problem at work with some scripts that were written (and are supposed to be maintained by our software group. I may sound kind of long-winded in this post so <readmore> tags will be used; please bear with me.

Background

I work in satellite operations (nothing cool like satellite TV, just NASA). One of the satellites I work for is not manned 24/7. We have software that is designed to manage the satellite when we are not staffed and notify us whenever something is wrong, either with the spacecraft or the ground system. The spacecraft was launched in 1995 and built well before that, so much of the hardware that we have is out-dated.

To make a long story short, we have 2 strings of independent computers (HPUX system) that are basically duplicates of each other. We also have 1 machine which is not part of either string that acts as a Failover Monitor to keep an eye on the other machines and if a problem is detected, move operations from the prime string to the backup string.

Problem

Saturday night, we had a hard drive that was mounted to string 1 fail. This had the effect of locking all the machines to which it was mounted. They were still up and appeared to be okay, but processes on them were frozen. This is exactly the type of thing that the Failover Monitor was designed to detect and correct by forcing a failover to string2. The Failover monitor did not notify the Flight Operations Team (FOT) that anything was wrong or take any action.


Luckily for me, most of the code is written in Perl so I could look through the code and figure out what happened. I have determined the cause of the problem, and I have a few ideas about how to go about fixing, but there are several restrictions.

1) The code is controlled by the software group so I cannot *directly* modify it myself, fortunately I work closely with the developer so I do actually get to create some of the code, but at the least, tell her how I want it done.

2) As stated above, this is an outdated system. We are currently running Perl Version 5.004_01. I have attempted in the past to get them to upgrade but to no avail.

3) Pure Perl modules that do not need to be installed I can add by putting them in my own directory and referencing them. Non Pure-Perl modules that are not included in standard 5.004_01 release will be not be available for use, and the system admins will not install them for me.

The actually offending code is one of the two following commands (either of which would create the same problem, I just don’t know exactly where in the code the script was when it became frozen).
chomp($pse=`remsh $status[0] ps –ef | grep ‘pse’ | grep –v grep | wc – +l`);
or
`rcp $status[0]:alive.log alive.log.p`;

One of these two commands was issued while string 1 was locked. The remsh or rcp connection was opened (or did not fail) but the process never completed. It sat at this position holding the calling script hostage, hence it was never able to notify the FOT that anything was wrong.

I have come up with a few ideas about how to get around this problem:
1) Eliminate the system calls and use Net::FTP and Net::Telnet where I can set a timeout period so that if the command does not complete in X time, the failover monitor will realize something is wrong and can contact the FOT.

2) Eliminate the system calls and use a socket connection, of which I am not familiar, so some direction towards a good tutorial would be helpful (Can’t find one on here)

3) If the developers are insistent on keeping their system calls, I can at least get them to modify it so this problem does not re-occur, sloppy/in-elegant example that still has problems:

#Start the remote shell and get the PID of the process #Pipe the output into a file chomp($pse_id=`remsh $status[0] ps –ef | grep ‘pse’ | grep –v grep | w +c –l>pse_status &`); my $not_done=1; my $error_count=0; while ($not_done==1) { my $running=`ps –ef | grep ‘$pse_id’ |grep –v grep |wv –l`; if ($running==0) { #process is still running, sleep then try again increment erro +r counter if ($error_count>10) { &notify_FOT; } else { $error_count++; } sleep(5); } else { open(TEMP,”<pse_status”) or &notify_FOT; my @temp=<TEMP>; close(TEMP); $pse=chomp($temp[0]); last; # or $not_done=0; } }
4) Some other method that I am not thinking of.

Questions:

If you read this far, thanx!
1) Which method do you think is the best for preventing this sort of problem?

2) Have you found any simple probelms in someone else's code that cause BIG problems where you work?


-Kevin
my $a='62696c6c77667269656e6440676d61696c2e636f6d'; while ($a=~m/(^.{2})/s) {print unpack('A',pack('H*',"$1"));$a=~s/^.{2}//s;}