I am now using your code posted above (using open3 instead of exec/system, etc.):
use strict;
use warnings;
use Parallel::ForkManager;
use IPC::Open3 qw( open3 );
use POSIX qw( WNOHANG );
use constant TIMEOUT => 120;
my @runArray = ("test1.sh", "test2.sh", "test3.sh", "test4.sh", "test5
+.sh");
my ($pid, $exitCode, $ident);
my $currentTime;
my $forkMgr = Parallel::ForkManager->new(3);
$forkMgr->run_on_start(
sub {
($pid, $ident) = @_;
print "$currentTime Started ==> $ident\n";
}
);
$forkMgr->run_on_finish(
sub {
($pid, $exitCode, $ident) = @_;
print "$currentTime Ended ==> $ident\n";
}
);
while (1) {
$currentTime = localtime();
for my $runCommand (@runArray) {
$forkMgr->start($runCommand) and next;
my $pid = open3('<&STDIN', '>&STDOUT', '>&STDERR',
"/usr/localcw/opt/patrol/nagios/libexec/$runCo
+mmand");
wait_for_test_to_end($pid);
$forkMgr->finish($? & 0x7F ? 0x80 | ($? & 0x7F) : $? >> 8);
}
$forkMgr->wait_all_children;
sleep 10;
}
exit;
sub wait_for_test_to_end {
my ($pid) = @_;
my $abs_timeout = time() + TIMEOUT;
while (1) {
return if waitpid($pid, WNOHANG) > 0;
last if time() > $abs_timeout;
sleep(1);
}
kill(ALRM => $pid);
$abs_timeout = time() + 15;
while (1) {
return if waitpid($pid, WNOHANG) > 0;
last if time() > $abs_timeout;
sleep(1);
}
kill(KILL => $pid);
waitpid($pid, 0);
}
Still same behavior.
The looping test4.sh is still hanging everything up.
Here's some trace output.
The "Started/Ended" statements are coming from the callbacks and the "I am running..." are coming from the test1-5.sh scripts.
Thu May 14 16:52:40 2015 Started ==> test1.sh
Thu May 14 16:52:40 2015 Started ==> test2.sh
Thu May 14 16:52:41 CDT 2015 I am running test1.sh
Thu May 14 16:52:41 CDT 2015 I am running test2.sh
Thu May 14 16:52:41 CDT 2015 I am running test3.sh
Thu May 14 16:52:40 2015 Started ==> test3.sh
Thu May 14 16:52:40 2015 Ended ==> test3.sh
Thu May 14 16:52:40 2015 Ended ==> test2.sh
Thu May 14 16:52:40 2015 Ended ==> test1.sh
Thu May 14 16:52:40 2015 Started ==> test4.sh
Thu May 14 16:52:43 CDT 2015 I am running test4.sh
Thu May 14 16:52:43 CDT 2015 I am running test5.sh
Thu May 14 16:52:53 CDT 2015 I am running test4.sh
Thu May 14 16:53:03 CDT 2015 I am running test4.sh
Thu May 14 16:53:13 CDT 2015 I am running test4.sh
Thu May 14 16:53:23 CDT 2015 I am running test4.sh
Thu May 14 16:53:33 CDT 2015 I am running test4.sh
Thu May 14 16:53:43 CDT 2015 I am running test4.sh
Thu May 14 16:53:53 CDT 2015 I am running test4.sh
Thu May 14 16:54:03 CDT 2015 I am running test4.sh
Thu May 14 16:54:13 CDT 2015 I am running test4.sh
Thu May 14 16:54:23 CDT 2015 I am running test4.sh
Thu May 14 16:54:33 CDT 2015 I am running test4.sh
Thu May 14 16:54:43 CDT 2015 I am running test4.sh
Thu May 14 16:52:40 2015 Started ==> test5.sh
Thu May 14 16:52:40 2015 Ended ==> test5.sh
Thu May 14 16:52:40 2015 Ended ==> test4.sh
Thu May 14 16:54:56 2015 Started ==> test1.sh
Thu May 14 16:54:56 2015 Started ==> test2.sh
Thu May 14 16:54:56 CDT 2015 I am running test1.sh
Thu May 14 16:54:56 CDT 2015 I am running test2.sh
Thu May 14 16:54:56 CDT 2015 I am running test3.sh
Thu May 14 16:54:56 2015 Started ==> test3.sh
Thu May 14 16:54:56 2015 Ended ==> test2.sh
Thu May 14 16:54:56 2015 Ended ==> test1.sh
Thu May 14 16:54:56 2015 Ended ==> test3.sh
Thu May 14 16:54:56 2015 Started ==> test4.sh
Thu May 14 16:54:58 CDT 2015 I am running test4.sh
Thu May 14 16:54:58 CDT 2015 I am running test5.sh
Thu May 14 16:55:08 CDT 2015 I am running test4.sh
Thu May 14 16:55:18 CDT 2015 I am running test4.sh
etc.
During the 2-minute timeout wait to kill test4.sh, nothing else is happening (not even the run_on_start/finish for test5.sh). I still have 2 forkable processes (of the defined 3) that are not being used, I believe, because forkmanager is waiting for all the children to be done.
I recognize that one process will be tied up for the timeout value, but I need the other two to continue processing available work (test1-3.sh and test5.sh). I'll take care to ensure test4 doesn't run again while there is one already running (using a hash of running jobs managed by the callbacks.
That is the crux of my problem.