http://qs321.pair.com?node_id=832365


in reply to Re^2: No child processes - system limit?
in thread No child processes - system limit?

select(8, [3], NULL, NULL, {172, 0}) = ? ERESTARTNOHAND (To be rest +arted) --- SIGCHLD (Child exited) @ 0 (0) --- sigreturn() = ? (mask now []) rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 waitpid(14232, 0xbfb45be8, WNOHANG) = 0 waitpid(14233, 0xbfb45be8, WNOHANG) = 0 waitpid(14225, 0xbfb45be8, WNOHANG) = -1 ECHILD (No child processe +s) ...

My interpretation of this would be (as you already figured) that $! is being modified in the signal handler before the interrupted select call gets a chance to be restarted, i.e. the redo SELECT doesn't execute because of that very modification of $!.

(Note that because of Perl's deferred (aka safe) signal handling, the sigreturn() (which is being called at the end of the "real" system/C-level signal handler) happens immediately, before the Perl signal handler runs all the waitpid calls. Still, they do run before the next Perl opcode executes (which means this is presumably before if ($!{EINTR} || $!{EAGAIN}) ).

What I find a little surprising is that the ECHILD does occur at all, because your $Children{$pid} should've been set to zero in the previous call to the signal handler

waitpid(14225, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 142 +25

where the waitpid did return 14225 (i.e. $res > 0). In other words, you shouldn't be calling waitpid(14225,...) again thereafter, because the 14225 is no longer supposed to be in the hash...  (update: err wait, this is nonsense of course, as you're iterating over the keys, not the values.  OTOH, this brings up the question what would happen if you did set the values to the PIDs, too, and then iterate over the values instead (as you seem be to getting that panic when deleting the keys...)

Maybe you could try to figure out why this is — in addition to trying to localize $! as a workaround, of course.

Replies are listed 'Best First'.
Re^4: No child processes - system limit?
by clinton (Priest) on Apr 01, 2010 at 19:44 UTC
    local'ising $! seems to have sorted out that issue, revealing the real error that is happening on the remote process.

    Re your other point, yes - deleting keys in the hash causes a panic, but I'll change the loop to only waitpid to those keys that have true values, which should help

    thanks