wait3 (waitpid alias) returns -1 with errno set to ECHILD when it should not
TLDR: you are currently relying on unspecified behaviour of signal
(2); use sigaction
(carefully) instead.
Firstly, SIGCHLD
is strange. From the manual page for sigaction
;
POSIX.1-1990 disallowed setting the action for
SIGCHLD
toSIG_IGN
. POSIX.1-2001 allows this possibility, so that ignoringSIGCHLD
can be used to prevent the creation of zombies (seewait
(2)). Nevertheless, the historical BSD and System V behaviors for ignoringSIGCHLD
differ, so that the only completely portable method of ensuring that terminated children do not become zombies is to catch theSIGCHLD
signal and perform await
(2) or similar.
And here's the bit from wait
(2)'s manual page:
POSIX.1-2001 specifies that if the disposition of
SIGCHLD
is set toSIG_IGN
or theSA_NOCLDWAIT
flag is set forSIGCHLD
(seesigaction
(2)), then children that terminate do not become zombies and a call towait()
orwaitpid()
will block until all children have terminated, and then fail with errno set toECHILD
. (The original POSIX standard left the behavior of settingSIGCHLD
toSIG_IGN
unspecified. Note that even though the default disposition ofSIGCHLD
is "ignore", explicitly setting the disposition toSIG_IGN
results in different treatment of zombie process children.) Linux 2.6 conforms to this specification. However, Linux 2.4 (and earlier) does not: if await()
orwaitpid()
call is made whileSIGCHLD
is being ignored, the call behaves just as thoughSIGCHLD
were not being ignored, that is, the call blocks until the next child terminates and then returns the process ID and status of that child.
Note the effect of that is that if the signal's handling behaves like SIG_IGN
is set, then (under Linux 2.6+) you will see the behaviour you are seeing - i.e. wait()
will return -1
and ECHLD
because the child will have been automatically reaped.
Secondly, signal handling with pthreads
(which I think you are using here) is notoriously hard. The way it's meant to work (as I'm sure you know) is that process directed signals get sent to an arbitrary thread within the process that has the signal unmasked. But whilst threads have their own signal mask, there is a process wide action handler.
Putting these two things together, I think you are running across a problem I've run across before. I have had problems getting SIGCHLD
handling to work with signal()
(which is fair enough as that was deprecated prior to pthreads), which were fixed by moving to sigaction
and carefully setting per thread signal masks. My conclusion at the time was that the C library was emulating (with sigaction
) what I was telling it to do with signal()
, but was getting tripped up by pthreads
.
Note that you are currently relying on unspecified behaviour. From the manual page of signal(2)
:
The effects of
signal()
in a multithreaded process are unspecified.
Here's what I recommend you do:
- Move to
sigaction()
andpthread_sigmask()
. Explicitly set the handling of all the signals you care about (even if you think that's the current default), even when setting them toSIG_IGN
orSIG_DFL
. I block signals whilst I do this (possibly overabundance of caution but I copied the example from somewhere).
Here's what I am doing (roughly):
sigset_t set;struct sigaction sa;/* block all signals */sigfillset (&set);pthread_sigmask (SIG_BLOCK, &set, NULL);/* Set up the structure to specify the new action. */memset (&sa, 0, sizeof (struct sigaction));sa.sa_handler = handlesignal; /* signal handler for INT, TERM, HUP, USR1, USR2 */sigemptyset (&sa.sa_mask);sa.sa_flags = 0;sigaction (SIGINT, &sa, NULL);sigaction (SIGTERM, &sa, NULL);sigaction (SIGHUP, &sa, NULL);sigaction (SIGUSR1, &sa, NULL);sigaction (SIGUSR2, &sa, NULL);sa.sa_handler = SIG_IGN;sigemptyset (&sa.sa_mask);sa.sa_flags = 0;sigaction (SIGPIPE, &sa, NULL); /* I don't care about SIGPIPE */sa.sa_handler = SIG_DFL;sigemptyset (&sa.sa_mask);sa.sa_flags = 0;sigaction (SIGCHLD, &sa, NULL); /* I want SIGCHLD to be handled by SIG_DFL */pthread_sigmask (SIG_UNBLOCK, &set, NULL);
Where possible set all your signal handlers and masks etc. prior to any
pthread
operations. Where possible do not change signal handlers and masks (you might need to do this prior to and subsequent tofork()
calls).If you need to a signal handler for
SIGCHLD
(rather than relying onSIG_DFL
), if possible let it be received by any thread, and use the self-pipe method or similar to alert the main program.If you must have threads that do/don't handle certain signals, try to restrict yourself to
pthread_sigmask
in the relevant thread rather thansig*
calls.Just in case you run headlong into the next issue I ran into, ensure that after you have
fork()
'd, you set up again the signal handling from scratch (in the child) rather than relying on whatever you might inherit from the the parent process. If there's one thing worse than signals mixed with pthread, it's signals mixed with pthread withfork()
.
Note I cannot explain exactly entirely why change (1) works, but it has fixed what looks like a very similar issue for me and was after all relying on something that was 'unspecified' previously. It's closest to your 'hypothesis 2' but I think it is really incomplete emulation of legacy signal functions (specifically emulating the previously racy behaviour of signal()
which is what caused it to be replaced by sigaction()
in the first place - but this is just a guess).
Incidentally, I suggest you use wait4()
or (as you aren't using rusage
) waitpid()
rather than wait3()
, so you can specify a specific PID to wait for. If you have something else that generates children (I've had a library do it), you may end up waiting for the wrong thing. That said, I don't think that's what's happening here.