http://qs321.pair.com?node_id=832300

clinton has asked for the wisdom of the Perl Monks concerning the following question:

Hi all

When running multiple HTTP requests via LWP, I occasionally get this error: select failed: no child processes. If I catch the error, and wait a little, I can repeat the request successfully.

The code from LWP::Protocol::http which throws this error is this:

269 SELECT: 270 { 271 my $nfound = select($rbits, $wbits, undef, $se +l_timeout); 272 if ($nfound < 0) { 273 if ($!{EINTR} || $!{EAGAIN}) { 274 if ($time_before) { 275 $sel_timeout = $sel_timeout_before + - (time - $time_before); 276 $sel_timeout = 0 if $sel_timeout < + 0; 277 } 278 redo SELECT; 279 } 280 die "select failed: $!"; 281 } 282 }

It seems I am running into some system limit, but I can't figure out which. ulimit -a on this (linux) system outputs this:

core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 27968 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 16384 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 27968 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

And the limits as reported by BSD::Resource are as follows:

RLIMIT_CPU : -1 RLIMIT_OPEN_MAX : 16384 RLIMIT_LOCKS : -1 RLIMIT_VMEM : -1 RLIMIT_FSIZE : -1 RLIMIT_STACK : 10485760 RLIMIT_MEMLOCK : 32768 RLIMIT_NOFILE : 16384 RLIMIT_DATA : -1 RLIMIT_NPROC : 27968 RLIMIT_OFILE : 16384 RLIMIT_AS : -1 RLIMIT_CORE : 0 RLIMIT_RSS : -1

Any ideas what I can change to get around this?

thanks

Clint

Replies are listed 'Best First'.
Re: No child processes - system limit?
by almut (Canon) on Apr 01, 2010 at 13:57 UTC

    Normally, you'd get this error (ECHILD) if you wait for a child, but there is no child, e.g.

    $ perl -e 'die $! if wait == -1' No child processes at -e line 1.

    In other words, I'm not sure if this is (directly) related to some resource limit at all... (though, of course, it might be a follow-up error of some code doing a wait for a child that never had been created, due to a resource limit like memory, or max children per user).

      Well, the reason I'm thinking resource limit is that this only occurs when busy, and then a couple of seconds later it works fine again.

      The docs for select indicate that this is the select(2) system call, but the docs for that say the following:

      ...On error, -1 is returned, and errno is set appropriately;...

      and lists the following errors:
      • EBADF
        An invalid file descriptor was given in one of the sets. (Perhaps a file descriptor that was already closed, or one on which an error has occurred.)
      • EINTR
        A signal was caught; see signal(7).
      • EINVAL
        nfds is negative or the value contained within timeout is invalid.
      • ENOMEM
        unable to allocate memory for internal tables.
      ... none of which correspond to the no child processes, and leaving me at a bit of a loss

        What do you get for

        $ getconf CHILD_MAX

        (or getconf -a, just in case...)

Re: No child processes - system limit?
by ikegami (Patriarch) on Apr 01, 2010 at 16:52 UTC

    Do you have any signal handlers?

    Are you using fork, system, threads or some means of parallelising?

      Yes - in the parent process, I'm reading 5000 records from a source, then forking off a child to reindex each of those 5000 records. The parent forks $max_kids processes, recording the PIDs in a hash, then waits until there are fewer than $max_kids active.

      My reaper looks like this:

      #=================================== sub _REAPER { #=================================== my $params = shift; foreach my $pid ( keys %Children ) { my $res = waitpid( $pid, WNOHANG ); if ( $res > 0 ) { $Children{$pid} = 0; die "Error in child" if $?; } } $SIG{'CHLD'} = \&_REAPER; }

      Note, in the reaper, I set $Children{$pid} = 0 instead of deleting the key, as that was causing panic: freed scalar errors. I now clean up the %Children hash in the main loop of the parent.

      The error I'm seeing is at the stage in the parent when I'm reading the 5,000 records from the source

      thanks

      Clint

      At the suggestion of moritz, I ran the script with strace, the relevant bits of which are as follows:

      Here is where the parent child makes the request:

      At this stage, my code catches the select failed: no child processes error in an eval, issues a warning, then sleeps before retrying:

      I'm not sure what most of this means, but is the value of $! being set to "no child processes" by one of my waitpid calls, which is interfering with the code in LWP::Protocol::http? Would it help if I localised $! in my reaper sub?

        Would it help if I localised $! in my reaper sub?

        I believe so. That's exactly where I was going with my question.

        select(8, [3], NULL, NULL, {172, 0}) = ? ERESTARTNOHAND (To be rest +arted) --- SIGCHLD (Child exited) @ 0 (0) --- sigreturn() = ? (mask now []) rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 waitpid(14232, 0xbfb45be8, WNOHANG) = 0 waitpid(14233, 0xbfb45be8, WNOHANG) = 0 waitpid(14225, 0xbfb45be8, WNOHANG) = -1 ECHILD (No child processe +s) ...

        My interpretation of this would be (as you already figured) that $! is being modified in the signal handler before the interrupted select call gets a chance to be restarted, i.e. the redo SELECT doesn't execute because of that very modification of $!.

        (Note that because of Perl's deferred (aka safe) signal handling, the sigreturn() (which is being called at the end of the "real" system/C-level signal handler) happens immediately, before the Perl signal handler runs all the waitpid calls. Still, they do run before the next Perl opcode executes (which means this is presumably before if ($!{EINTR} || $!{EAGAIN}) ).

        What I find a little surprising is that the ECHILD does occur at all, because your $Children{$pid} should've been set to zero in the previous call to the signal handler

        waitpid(14225, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 142 +25

        where the waitpid did return 14225 (i.e. $res > 0). In other words, you shouldn't be calling waitpid(14225,...) again thereafter, because the 14225 is no longer supposed to be in the hash...  (update: err wait, this is nonsense of course, as you're iterating over the keys, not the values.  OTOH, this brings up the question what would happen if you did set the values to the PIDs, too, and then iterate over the values instead (as you seem be to getting that panic when deleting the keys...)

        Maybe you could try to figure out why this is — in addition to trying to localize $! as a workaround, of course.