Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Parallel::ForkManager and wait_all_children

by rgren925 (Beadle)
on May 13, 2015 at 00:03 UTC ( [id://1126480]=perlquestion: print w/replies, xml ) Need Help??

rgren925 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks. I have spent days and days searching this site and everywhere trying to resolve this issue. I have a script that uses Parallel::ForkManager to fork off 3 concurrent processes. It loops through an array of 5 and the behavior I am looking for is to have three running all the time. The problem is if one of them hangs or is in a loop, the processing stops as wait_all_children is never satisfied. What I am looking for is to have the looping process do its thing (and I would add code to notify this is happening and, potentially, to automatically kill it) while allowing the other two available forks to continue processing. So, wait_all_children doesn't cut it. In my test script, below, I am forking 5 different scripts. Scripts test[1235].sh just echo "I am running testx.sh" while test4.sh echoes the same in an endless loop with a sleep 10 in it. It all stops and leaves just test4.sh running indefinitely and, since wait_all_children isn't satisfied, the others are never forked off again.
use strict; use warnings; use Parallel::ForkManager; my @runArray = ("test1.sh", "test2.sh", "test3.sh", "test4.sh", "test5 +.sh"); my ($pid, $exitCode, $ident); my $forkMgr = Parallel::ForkManager->new(3); $forkMgr->run_on_start( sub { ($pid, $ident) = @_; print "Started ==> $ident\n"; } ); $forkMgr->run_on_finish( sub { ($pid, $exitCode, $ident) = @_; print "Ended ==> $ident\n"; } ); while (1) { for my $runCommand (@runArray) { $forkMgr->start($runCommand) and next; system("/usr/localcw/opt/patrol/nagios/libexec/$runCommand"); $forkMgr->finish; } $forkMgr->wait_all_children; sleep 10; } exit;
Even commenting out wait_all_children doesn't help. This is a sample from a critical application I am working on and I am at wit's end. Thanks for your consideration, Rick

Replies are listed 'Best First'.
Re: Parallel::ForkManager and wait_all_children
by ikegami (Patriarch) on May 13, 2015 at 02:24 UTC

    fork+system+exit is wasteful. It creates a process whose entire purpose is to launch another process and wait for it to finish. Let's start by simplifying to fork+exec.

    while (1) { for my $runCommand (@runArray) { $forkMgr->start($runCommand) and next; exec("/usr/localcw/opt/patrol/nagios/libexec/$runCommand") or die("exec: $!"); } $forkMgr->wait_all_children; sleep 10; }

    Now, on to your problem. Your proposed solution of letting hung workers stay hung doesn't make much sense. You'll eventually end up with three hung workers. You gotta kill them if they become hung. Unless you have a better condition based on knowledge of the test programs, a process can be considered hung if it has been running more than some configured amount of time.

    Since we can execute code in the process that will execute the test program, all we need to do is call alarm from that process.

    use constant TIMEOUT => 60; while (1) { for my $runCommand (@runArray) { $forkMgr->start($runCommand) and next; alarm(TIMEOUT); exec("/usr/localcw/opt/patrol/nagios/libexec/$runCommand") or die("exec: $!"); } $forkMgr->wait_all_children; sleep 10; }

      How's that supposed to work? If you exec another program, the running perl is terminated so can't send the alarm. Don't you have to use system() and kill the child process if it times out?

        How's that supposed to work? If you exec another program, the running perl is terminated so can't send the alarm.

        So I think that a SIGALRM is delivered to the process started via exec(). Unless the process changes its signal handler for SIGALRM, that signal will kill the process.

        Let's test that:

        #!/usr/bin/perl use strict; use warnings; sub helper { # forked process, wastes 10 seconds for (1..10) { print "helper: start of second $_\n"; select(undef,undef,undef,1); # poor man's sleep, witho +ut messing with alarm print "helper: end of second $_\n"; } } sub main { # main process print "Helper will die in 5 seconds\n"; alarm(5); # kill me in five seconds ... exec($^X,$0,"dummy argument") # start perl with this script an +d a parameter or die "Could not start helper: $!"; } if (@ARGV) { helper(); } else { main(); }

        Output:

        >perl alarmtest.pl Helper will die in 5 seconds helper: start of second 1 helper: end of second 1 helper: start of second 2 helper: end of second 2 helper: start of second 3 helper: end of second 3 helper: start of second 4 helper: end of second 4 helper: start of second 5 Alarm clock >

        Just for fun, let's add a signal handler for SIGALRM in the helper process:

        sub helper { $SIG{'ALRM'}=sub { print "I am immortal, you fool!\n" }; # forked process, wastes 10 seconds for (1..10) { print "helper: start of second $_\n"; select(undef,undef,undef,1); # poor man's sleep, witho +ut messing with alarm print "helper: end of second $_\n"; } }

        Output:

        >perl alarmtest.pl Helper will die in 5 seconds helper: start of second 1 helper: end of second 1 helper: start of second 2 helper: end of second 2 helper: start of second 3 helper: end of second 3 helper: start of second 4 helper: end of second 4 helper: start of second 5 I am immortal, you fool! helper: end of second 5 helper: start of second 6 helper: end of second 6 helper: start of second 7 helper: end of second 7 helper: start of second 8 helper: end of second 8 helper: start of second 9 helper: end of second 9 helper: start of second 10 helper: end of second 10 >

        Alexander

        Updates:

        1. changed links from [man://...] (FreeBSD) to http://linux.die.net/... (Linux)
        2. added second example with non-default signal handler
        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        If you exec another program, the running perl is terminated so can't send the alarm.

        alarm causes the system to send SIGALRM to the current process.

        Don't you have to use system() and kill the child process if it times out?

        You can't kill a child if you're waiting for it to exit using system, so you'd have to replace system.

        use IPC::Open3 qw( open3 ); use POSIX qw( WNOHANG ); use constant TIMEOUT => 60; sub wait_for_test_to_end { my ($pid) = @_; my $abs_timeout = time() + TIMEOUT; while (1) { return if waitpid($pid, WNOHANG) > 0; last if time() > $abs_timeout; sleep(1); } kill(ALRM => $pid); $abs_timeout = time() + 15; while (1) { return if waitpid($pid, WNOHANG) > 0; last if time() > $abs_timeout; sleep(1); } kill(KILL => $pid); waitpid($pid, 0); } while (1) { for my $runCommand (@runArray) { $forkMgr->start($runCommand) and next; my $pid = open3('<&STDIN', '>&STDOUT', '>&STDERR', "/usr/localcw/opt/patrol/nagios/libexec/$runCommand"); wait_for_test_to_end($pid); $forkMgr->finish($? & 0x7F ? 0x80 | ($? & 0x7F) : $? >> 8); } $forkMgr->wait_all_children; sleep 10; }

        And you're back to having a useless process between the manager than the test.

        On the plus side, you can use more complex conditions than a simple timeout. You can also forcibly kill the process if it doesn't respond to SIGALRM as the above demonstrates.

Re: Parallel::ForkManager and wait_all_children ( wait_for_available_procs )
by Anonymous Monk on May 13, 2015 at 00:14 UTC

    don't use wait_all_children :) problem solved :P

    Try wait_for_available_procs

      Aside from starting the 10s cooldown while tests are still running, that would just postpone when the harness will hang.
Re: Parallel::ForkManager and wait_all_children
by rgren925 (Beadle) on May 13, 2015 at 23:02 UTC

    Thanks all for the replies.

    I've tried to synthesize the different approaches.

    To flesh this out a bit, I'm planning on using callbacks (run_on_wait) to manage notifying/killing a hung process. Using the alarm(TIMEOUT) doesn't really solve my problem. If I set the timeout to 60, no other processes will run until that 60 seconds has elapsed as the wait_all_children still isn't satisfied.

    wait_for_available_procs (which was newer than my version of Parallel::ForkManager--so I upgraded) didn't seem to make any difference.

    The callbacks indicate that everything stalls until the looping test4.sh script is killed.

    use strict; use warnings; use Parallel::ForkManager; use constant TIMEOUT => 60; my @runArray = ("test1.sh", "test2.sh", "test3.sh", "test4.sh", "test5 +.sh"); my ($pid, $exitCode, $ident); my $forkMgr = Parallel::ForkManager->new(3); $forkMgr->run_on_start( sub { ($pid, $ident) = @_; print "Started ==> $ident\n"; } ); $forkMgr->run_on_finish( sub { ($pid, $exitCode, $ident) = @_; print "Ended ==> $ident\n"; } ); while (1) { for my $runCommand (@runArray) { $forkMgr->start($runCommand) and next; alarm(TIMEOUT); system("/usr/localcw/opt/patrol/nagios/libexec/$runCommand") o +r die ("exec: $!\n"); } $forkMgr->wait_all_children; sleep 10; } exit;

      If I set the timeout to 60, no other processes will run until that 60 seconds has elapsed as the wait_all_children still isn't satisfied.

      Only in the rare instances when it hangs, and only because it takes that long for my method to detect that a process has become hung. That's as good as it gets without inside knowledge of the tests being run. If you know more about the tests being run (especially if you have the power to change them), then a much more responsive solution can be created.


      wait_for_available_procs (which was newer than my version of Parallel::ForkManager--so I upgraded) didn't seem to make any difference.

      wait_for_available_procs(3) won't make a difference.

      wait_for_available_procs(1) will make a difference, but it introduces a bug and merely postpones the problem.


      You introduced some major bugs in the code. Check your process list when it runs.

      1. Call finish after system

      2. You're killing the wrong process. You're not killing the child that's running the test. You're going to end up with lots of hung processes running. Already showed how to send the signal to the right process, and I already showed a much much simpler solution.

        I am now using your code posted above (using open3 instead of exec/system, etc.):
        use strict; use warnings; use Parallel::ForkManager; use IPC::Open3 qw( open3 ); use POSIX qw( WNOHANG ); use constant TIMEOUT => 120; my @runArray = ("test1.sh", "test2.sh", "test3.sh", "test4.sh", "test5 +.sh"); my ($pid, $exitCode, $ident); my $currentTime; my $forkMgr = Parallel::ForkManager->new(3); $forkMgr->run_on_start( sub { ($pid, $ident) = @_; print "$currentTime Started ==> $ident\n"; } ); $forkMgr->run_on_finish( sub { ($pid, $exitCode, $ident) = @_; print "$currentTime Ended ==> $ident\n"; } ); while (1) { $currentTime = localtime(); for my $runCommand (@runArray) { $forkMgr->start($runCommand) and next; my $pid = open3('<&STDIN', '>&STDOUT', '>&STDERR', "/usr/localcw/opt/patrol/nagios/libexec/$runCo +mmand"); wait_for_test_to_end($pid); $forkMgr->finish($? & 0x7F ? 0x80 | ($? & 0x7F) : $? >> 8); } $forkMgr->wait_all_children; sleep 10; } exit; sub wait_for_test_to_end { my ($pid) = @_; my $abs_timeout = time() + TIMEOUT; while (1) { return if waitpid($pid, WNOHANG) > 0; last if time() > $abs_timeout; sleep(1); } kill(ALRM => $pid); $abs_timeout = time() + 15; while (1) { return if waitpid($pid, WNOHANG) > 0; last if time() > $abs_timeout; sleep(1); } kill(KILL => $pid); waitpid($pid, 0); }
        Still same behavior.
        The looping test4.sh is still hanging everything up.
        Here's some trace output. The "Started/Ended" statements are coming from the callbacks and the "I am running..." are coming from the test1-5.sh scripts.
        Thu May 14 16:52:40 2015 Started ==> test1.sh Thu May 14 16:52:40 2015 Started ==> test2.sh Thu May 14 16:52:41 CDT 2015 I am running test1.sh Thu May 14 16:52:41 CDT 2015 I am running test2.sh Thu May 14 16:52:41 CDT 2015 I am running test3.sh Thu May 14 16:52:40 2015 Started ==> test3.sh Thu May 14 16:52:40 2015 Ended ==> test3.sh Thu May 14 16:52:40 2015 Ended ==> test2.sh Thu May 14 16:52:40 2015 Ended ==> test1.sh Thu May 14 16:52:40 2015 Started ==> test4.sh Thu May 14 16:52:43 CDT 2015 I am running test4.sh Thu May 14 16:52:43 CDT 2015 I am running test5.sh Thu May 14 16:52:53 CDT 2015 I am running test4.sh Thu May 14 16:53:03 CDT 2015 I am running test4.sh Thu May 14 16:53:13 CDT 2015 I am running test4.sh Thu May 14 16:53:23 CDT 2015 I am running test4.sh Thu May 14 16:53:33 CDT 2015 I am running test4.sh Thu May 14 16:53:43 CDT 2015 I am running test4.sh Thu May 14 16:53:53 CDT 2015 I am running test4.sh Thu May 14 16:54:03 CDT 2015 I am running test4.sh Thu May 14 16:54:13 CDT 2015 I am running test4.sh Thu May 14 16:54:23 CDT 2015 I am running test4.sh Thu May 14 16:54:33 CDT 2015 I am running test4.sh Thu May 14 16:54:43 CDT 2015 I am running test4.sh Thu May 14 16:52:40 2015 Started ==> test5.sh Thu May 14 16:52:40 2015 Ended ==> test5.sh Thu May 14 16:52:40 2015 Ended ==> test4.sh Thu May 14 16:54:56 2015 Started ==> test1.sh Thu May 14 16:54:56 2015 Started ==> test2.sh Thu May 14 16:54:56 CDT 2015 I am running test1.sh Thu May 14 16:54:56 CDT 2015 I am running test2.sh Thu May 14 16:54:56 CDT 2015 I am running test3.sh Thu May 14 16:54:56 2015 Started ==> test3.sh Thu May 14 16:54:56 2015 Ended ==> test2.sh Thu May 14 16:54:56 2015 Ended ==> test1.sh Thu May 14 16:54:56 2015 Ended ==> test3.sh Thu May 14 16:54:56 2015 Started ==> test4.sh Thu May 14 16:54:58 CDT 2015 I am running test4.sh Thu May 14 16:54:58 CDT 2015 I am running test5.sh Thu May 14 16:55:08 CDT 2015 I am running test4.sh Thu May 14 16:55:18 CDT 2015 I am running test4.sh etc.
        During the 2-minute timeout wait to kill test4.sh, nothing else is happening (not even the run_on_start/finish for test5.sh). I still have 2 forkable processes (of the defined 3) that are not being used, I believe, because forkmanager is waiting for all the children to be done. I recognize that one process will be tied up for the timeout value, but I need the other two to continue processing available work (test1-3.sh and test5.sh). I'll take care to ensure test4 doesn't run again while there is one already running (using a hash of running jobs managed by the callbacks.

        That is the crux of my problem.

Re: Parallel::ForkManager and wait_all_children
by BrowserUk (Patriarch) on May 16, 2015 at 10:08 UTC

    Now the noise of irrelevant side-issues has abated, have you tried setting $pm->set_waitpid_blocking_sleep(0); and see how that changes your results?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Parallel::ForkManager and wait_all_children
by sundialsvc4 (Abbot) on May 13, 2015 at 13:27 UTC

    Another way to do it, at the expense of having twice-as-many child processes, is to initially fork a child which sets an alarm and then forks the actual child.   If the alarm goes off, this process kills its child and then exits (with a return-code of 1).   It does nothing else but to wait, either for its child to exit or for the alarm to go off.

    The way I saw it done was with a small command:   timed_exec -t timeout command.   That’s what the parent-process actually executed and waited-for.   But it was such a handy thing that I saw it being used in a lot of shell-scripts, too.

      He already has the extra process, so there wouldn't be any expense.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1126480]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2024-03-29 00:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found