Not getting the expected result when using eval/alarm

McDarren has asked for the wisdom of the Perl Monks concerning the following question:

Howdy,

I have a script that runs periodically via crond, which connects to a number of remote MySQL databases (around 250 hosts) to gather data. The MySQL connections are made via a ssh tunnel. It all works fine, and has been running for about two years now - except for one problem which I haven't been able to resolve. Occasionally, one or more of the remote hosts will be uncontactable (for whatever reason), and the script just hangs. To get around this, the logical thing to do seemed to simply wrap the initial connection in an eval block, and set a timeout. So I have some code that looks like so:

HOST: foreach $host (@hostlist) {
    my $timeout = 30;
    #connect to the remote site
    Log(LOGFILE, "INFO: $0::Connecting to $host");

    eval {
        local $SIG{ALRM} = sub { die "alarm\n" };
        alarm $timeout;
        $ibisSite=openRemoteSql($host,$mySocket,"$user","$pass");
        alarm 0;
    };

    if ($@) {
        Log(LOGFILE, "ERROR: $0:$host:Cant Connect to site:$@");
        next HOST;
    }
[download]

The above is not working as expected though - as the script still hangs whenever it cannot reach any of the remote hosts, and I have to manually kill off the sub-process to get things moving again.

openRemoteSql is a home-grown routine which is called from an external library and establishes the MySQL connection over the ssh tunnel. I'm ~~quite~~fairly certain that there is nothing wrong with at as it's used sucessfully in several other scripts.

Can anybody see what I'm doing wrong here?

Thanks,
Darren

Update 1: Based on the replies I have so far, and my own niggling suspicions, it seems that perhaps openRemoteSql is the culprit. I've had a closer look at it and it does fork. The problem is that it's inherited code that was written several years ago, and I'm loathe to mess with it as it has dependancies all over the place. But *sigh*.. I may have no choice - unless somebody can suggest some other workaround?

Update 2: Here is a copy of the openRemoteSql routine, and the openRemote routine that it in turn calls...

#
#   Use openRemote to create a tunnel to the remote mysql server.
#   Then try to connect to it every 3 seconds for 15 seconds (I made t
+hese numbers
#   up ... they may need to increase for China ! )
#
sub openRemoteSql
{
    $hostname = $_[0] ;
    $port = $_[1] ;
    $user = $_[2] ;
    $pass = $_[3] ;

    #
    #   Look for a free "port" to use
    $cnt = 0 ;
    $tmp = -1 ;
    while( $tmp == -1 )
    {
        if( $cnt > 10 )
        {
            print "can't find a free port in $_[1] .. $port\n" ;
            return 0;
        }
        $port = $_[1] + $cnt ;
        #print $port , "\n" ;
        $tmp = openRemote( $hostname,$port,3306,$user,$pass ) ;
        $cnt ++;
    }

    #
    #   if openRemote returns '0' then the ssh failed, probably can't 
+contact
    #   host or some other shyte
    #
    if( $tmp < 1 )
    {
        return $tmp ;
    }

    $cnt = 0 ;
    while( $cnt < 5 )
    {
        $db = DBI->connect("dbi:mysql:database="$remotedb";host=127.0.
+0.1;port=$port", "$user", "$pass");
        if( $db )
        {
            #
            #   Wow .. it all worked .....
            #
            return $db ;
        }
        $cnt ++ ;
        sleep( 3 ) ;
    }
    closeRemote($port) ;
    return 0 ;
}
[download]

and...

#
#   Open the SSH Connection to the site.... this provides us with a tu
+nnel for
#   any other services !
#
#   Return
#       -1 Lock File Exists
#       0 == Failure
#       1 = allOk
#

sub openRemote
{
    use FileHandle;
    use IPC::Open2;

    $hostname = $_[0] ;
    $lport = $_[1] ;
    $rport = $_[2] ;
    $user = $_[3] ;
    $pass = $_[4] ;
    $timeout = $_[5] ;  # mod 001 add timeout to open ssh
    if( ! $timeout )
    {
        $timeout = 120 ;
    }

    if( ! testSite( $hostname ) )
    {
        return 0 ;
    }

    ##################################################
    #
    #   If the Lock file exists return -1
    #
    ##################################################
    $lockFile = LockFileName( $lport ) ;
    if ( -e $lockFile )
    {
        return -1 ;
    }

    $g_RemotePort = $lport  ;

    ##################################################
    #
    #   create the ssh gateway
    #
    ##################################################
    #
    #   Save current alarm handler
    #
    $saved = $SIG{ ALRM } ;
    $SIG{ALRM} = sub { die 'Open Remote : timeout' } ;
    eval{

        alarm( $timeout ) ;

        $s = " /usr/bin/ssh -T root@" . $hostname . "  -L ". $lport ."
+:127.0.0.1:" . $rport . " -g  " ;
        $pid = open2( \*SSHRead , \*SSHWrite , $s ) || die "Can't open
+ a ssh connection to " . $hostname ;
        $debug = 0;
        if( $debug == 1 ) {
            print $hostname . "Opened Connection returned pid = $pid \
+n" ;
        }


        #
        #   For some reason the constants are not defined ...
        #   open mode is create/write/exclusive
        #   if file exists .. then this will explode !
        #
        sysopen(WTMP, $lockFile, 0301)
            || die "Exclusive Access to $lockFile failed" ;

        print WTMP " kill -9 $pid\n" ;
        close WTMP ;

        SSHWrite->autoflush();
        #
        #   we execute the following commands so we can tell when the
        #   ssh has "REALLY" connected, and then we can check that we
        #   are on the host we are suppesed to be on .. this is an un-
+necessary
        #   step but I don't mind 'cos there'd be real problems if
        #   something went wrong, such as an old process hanging aroun
+d
        #
        print SSHWrite "echo XXXXX Started OK \n" ;
        print SSHWrite "echo \$HOSTNAME\n" ;

        SSHWrite->autoflush();

        #
        #   read what is being returned from the "remote end"
        #   timeout processing means that this will die if the
        #   above echos are not returned "pretty quick"
        #
        while( 1 )
        {
            $s = <SSHRead> ;

            if( ! defined($s))
            {
                print "SSH Failure\n";
                return 0 ;
            }

            if( $s =~ /bind:/ )
            {
                print "Failure $s\n" ;
                $s = <SSHRead> ;
                print "$s\n" ;
                return 0 ;
            }

            if( $s =~ /XXXXX Started OK/ )
            {
                $s = <SSHRead> ;
                return 1 ;
            }

        }
    } ;

    #
    #   if we timedout then set the pid to 0, code later on then handl
+es this
    #
    if ($@) {
            if (  $@ =~ /timeout/ ){
                print "FATAL ERROR : Timeout when ssh-ing to $hostname
+\n" ;
                $pid = 0 ;
            }
            $globalErrorMessage = $@ ;
    }

    #
    #   restore alarm handler
    #
    alarm(0) ;
    if( $saved )
    {
        $SIG{ ALRM } = $saved ;
    }

    #
    #   if the connection failed ... then return 0 --> Failure
    #
    if( ! $pid )
    {
        closeRemote( $g_RemotePort ) ;
        return 0 ;
    }

    return 1 ;

}
[download]

Comment on Not getting the expected result when using eval/alarm Select or Download Code

Replies are listed 'Best First'.
Re: Not getting the expected result when using eval/alarm by revdiablo (Prior) on Jan 08, 2006 at 06:30 UTC
I modified your example so it would work standalone (since you didn't provide your openRemoteSql routine, or example values for @hostlist, I improvised): `use strict; use warnings; my @hostlist = (1 .. 10); HOST: foreach my $host (@hostlist) { my $timeout = 2; eval { local $SIG{ALRM} = sub { die "alarm\n" }; alarm $timeout; openRemoteSql(); alarm 0; }; if ($@) { print "Timed out\n"; next HOST; } else { print "Didn't time out\n"; } } sub openRemoteSql { if (int rand 2) { print "Blocking\n"; <STDIN>; } }` [download] I have it randomly block -- simulating a long-running process -- and in those cases, the timeout appears to work as expected. This leads me to believe the code you pasted is fine, and your problem might lie elsewhere. If nothing else, this might help you narrow down the problem further. Update: added strict and warnings, modified code to pass	[reply] [d/l]
Re: Not getting the expected result when using eval/alarm by GrandFather (Saint) on Jan 08, 2006 at 07:32 UTC
The following code: `use strict; use warnings; my $timeout = 2; eval { local $SIG{ALRM} = sub { die "alarm\n" }; alarm $timeout; here: goto here; printf "Got past the goto!\n"; alarm 0; }; print "ERROR: $@" if $@;` [download] Prints: `ERROR: alarm` [download] which indicates that a simple non-terminated loop is not the problem. Me thinks you need to look closer at what `openRemoteSql` is doing. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Not getting the expected result when using eval/alarm by GrandFather (Saint) on Jan 08, 2006 at 11:02 UTC
I'd be inclined to sprinkle `here: goto here;` type code through openRemote where you do SSH stuff to see if you can reproduce the error. I'd guess though that it's a problem with the SSH stuff. Read the IPC::Open2 docs, especially from the para starting open2() does not wait for and reap the child process .... Following paras warn of deadlock situations that may or may not apply in your case. The next debugging step is probably to emit a log line after each SSHWrite/SSHRead operation and see where things are hanging up that way. You might change: `print SSHWrite "echo XXXXX Started OK \n" ; print SSHWrite "echo \$HOSTNAME\n" ; SSHWrite->autoflush();` [download] to: `print SSHWrite "echo XXXXX Started OK \necho \$HOSTNAME\n" ; SSHWrite->autoflush();` [download] or: `print SSHWrite "echo XXXXX Started OK \n" ; SSHWrite->autoflush(); print SSHWrite "echo \$HOSTNAME\n" ; SSHWrite->autoflush();` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Not getting the expected result when using eval/alarm by JamesNC (Chaplain) on Jan 08, 2006 at 14:33 UTC
try doing your dbi connect stuff like this: `eval { $db = DBI->connect("dbi:mysql:database="$remotedb"; host=127.0.0.1;p +ort=$port", "$user", "$pass", { RaiseError=>1, PrintError=>0 } ); }; if ($@){ #handle db error (ie, host not available.. blah blah }else{ return $db; }` [download] Notice I added {RaiseError=>1, PrintError=>0} to your dbi call and then eval the call to DBI. JamesNC	[reply] [d/l]
Re: Not getting the expected result when using eval/alarm by jesuashok (Curate) on Jan 08, 2006 at 06:33 UTC
Hi If you are sure that there is no fork happened in openRemoteSql, there won't be any problem as per the Code. But still you need to consider about the Operating System what you are using. Since you have specified that crond I assumed myself as that is linux. For safer side you can make the "openRemoteSql" to be called from child and get the Pid Status with the alarm time. That will help to solve your problem Sometimes It will be like The alaram status will be returned by the time the process would have lost its PID. That may cause Problem to you. "Keep pouring your ideas"	[reply]


Keep It Simple, Stupid
	PerlMonks