Howdy,
I have a script that runs periodically via crond, which connects to a number of remote MySQL databases (around 250 hosts) to gather data. The MySQL connections are made via a ssh tunnel. It all works fine, and has been running for about two years now - except for one problem which I haven't been able to resolve. Occasionally, one or more of the remote hosts will be uncontactable (for whatever reason), and the script just hangs. To get around this, the logical thing to do seemed to simply wrap the initial connection in an eval block, and set a timeout. So I have some code that looks like so:
HOST: foreach $host (@hostlist) {
my $timeout = 30;
#connect to the remote site
Log(LOGFILE, "INFO: $0::Connecting to $host");
eval {
local $SIG{ALRM} = sub { die "alarm\n" };
alarm $timeout;
$ibisSite=openRemoteSql($host,$mySocket,"$user","$pass");
alarm 0;
};
if ($@) {
Log(LOGFILE, "ERROR: $0:$host:Cant Connect to site:$@");
next HOST;
}
The above is not working as expected though - as the script still hangs whenever it cannot reach any of the remote hosts, and I have to manually kill off the sub-process to get things moving again.
openRemoteSql is a home-grown routine which is called from an external library and establishes the MySQL connection over the ssh tunnel. I'm quitefairly certain that there is nothing wrong with at as it's used sucessfully in several other scripts.
Can anybody see what I'm doing wrong here?
Thanks,
Darren
Update 1: Based on the replies I have so far, and my own niggling suspicions, it seems that perhaps openRemoteSql is the culprit. I've had a closer look at it and it does fork. The problem is that it's inherited code that was written several years ago, and I'm loathe to mess with it as it has dependancies all over the place. But *sigh*.. I may have no choice - unless somebody can suggest some other workaround?
Update 2: Here is a copy of the openRemoteSql routine, and the openRemote routine that it in turn calls...
#
# Use openRemote to create a tunnel to the remote mysql server.
# Then try to connect to it every 3 seconds for 15 seconds (I made t
+hese numbers
# up ... they may need to increase for China ! )
#
sub openRemoteSql
{
$hostname = $_[0] ;
$port = $_[1] ;
$user = $_[2] ;
$pass = $_[3] ;
#
# Look for a free "port" to use
$cnt = 0 ;
$tmp = -1 ;
while( $tmp == -1 )
{
if( $cnt > 10 )
{
print "can't find a free port in $_[1] .. $port\n" ;
return 0;
}
$port = $_[1] + $cnt ;
#print $port , "\n" ;
$tmp = openRemote( $hostname,$port,3306,$user,$pass ) ;
$cnt ++;
}
#
# if openRemote returns '0' then the ssh failed, probably can't
+contact
# host or some other shyte
#
if( $tmp < 1 )
{
return $tmp ;
}
$cnt = 0 ;
while( $cnt < 5 )
{
$db = DBI->connect("dbi:mysql:database="$remotedb";host=127.0.
+0.1;port=$port", "$user", "$pass");
if( $db )
{
#
# Wow .. it all worked .....
#
return $db ;
}
$cnt ++ ;
sleep( 3 ) ;
}
closeRemote($port) ;
return 0 ;
}
and...
#
# Open the SSH Connection to the site.... this provides us with a tu
+nnel for
# any other services !
#
# Return
# -1 Lock File Exists
# 0 == Failure
# 1 = allOk
#
sub openRemote
{
use FileHandle;
use IPC::Open2;
$hostname = $_[0] ;
$lport = $_[1] ;
$rport = $_[2] ;
$user = $_[3] ;
$pass = $_[4] ;
$timeout = $_[5] ; # mod 001 add timeout to open ssh
if( ! $timeout )
{
$timeout = 120 ;
}
if( ! testSite( $hostname ) )
{
return 0 ;
}
##################################################
#
# If the Lock file exists return -1
#
##################################################
$lockFile = LockFileName( $lport ) ;
if ( -e $lockFile )
{
return -1 ;
}
$g_RemotePort = $lport ;
##################################################
#
# create the ssh gateway
#
##################################################
#
# Save current alarm handler
#
$saved = $SIG{ ALRM } ;
$SIG{ALRM} = sub { die 'Open Remote : timeout' } ;
eval{
alarm( $timeout ) ;
$s = " /usr/bin/ssh -T root@" . $hostname . " -L ". $lport ."
+:127.0.0.1:" . $rport . " -g " ;
$pid = open2( \*SSHRead , \*SSHWrite , $s ) || die "Can't open
+ a ssh connection to " . $hostname ;
$debug = 0;
if( $debug == 1 ) {
print $hostname . "Opened Connection returned pid = $pid \
+n" ;
}
#
# For some reason the constants are not defined ...
# open mode is create/write/exclusive
# if file exists .. then this will explode !
#
sysopen(WTMP, $lockFile, 0301)
|| die "Exclusive Access to $lockFile failed" ;
print WTMP " kill -9 $pid\n" ;
close WTMP ;
SSHWrite->autoflush();
#
# we execute the following commands so we can tell when the
# ssh has "REALLY" connected, and then we can check that we
# are on the host we are suppesed to be on .. this is an un-
+necessary
# step but I don't mind 'cos there'd be real problems if
# something went wrong, such as an old process hanging aroun
+d
#
print SSHWrite "echo XXXXX Started OK \n" ;
print SSHWrite "echo \$HOSTNAME\n" ;
SSHWrite->autoflush();
#
# read what is being returned from the "remote end"
# timeout processing means that this will die if the
# above echos are not returned "pretty quick"
#
while( 1 )
{
$s = <SSHRead> ;
if( ! defined($s))
{
print "SSH Failure\n";
return 0 ;
}
if( $s =~ /bind:/ )
{
print "Failure $s\n" ;
$s = <SSHRead> ;
print "$s\n" ;
return 0 ;
}
if( $s =~ /XXXXX Started OK/ )
{
$s = <SSHRead> ;
return 1 ;
}
}
} ;
#
# if we timedout then set the pid to 0, code later on then handl
+es this
#
if ($@) {
if ( $@ =~ /timeout/ ){
print "FATAL ERROR : Timeout when ssh-ing to $hostname
+\n" ;
$pid = 0 ;
}
$globalErrorMessage = $@ ;
}
#
# restore alarm handler
#
alarm(0) ;
if( $saved )
{
$SIG{ ALRM } = $saved ;
}
#
# if the connection failed ... then return 0 --> Failure
#
if( ! $pid )
{
closeRemote( $g_RemotePort ) ;
return 0 ;
}
return 1 ;
}