Why would `killpg` return “not permitted” when ownership is correct? Why would `killpg` return “not permitted” when ownership is correct? unix unix

Why would `killpg` return “not permitted” when ownership is correct?


I added some debugging too (slightly modified source). It's happening when you try to kill a process group that's already exited, and in Zombie status. Oh, and it's easily repeatable just with [fast, fast].

$ python so.py spawned pgrp 6035spawned pgrp 6036Reaped pid: 6036, status: 0 6035  6034  6035 Z    (Python) 6034   521  6034 S+   python so.py 6037  6034  6034 S+   sh -c ps -e -o pid,ppid,pgid,state,command | grep -i python 6039  6037  6034 R+   grep -i pythonkilling pg 6035Error killing 6035: [Errno 1] Operation not permitted 6035  6034  6035 Z    (Python) 6034   521  6034 S+   python so.py 6040  6034  6034 S+   sh -c ps -e -o pid,ppid,pgid,state,command | grep -i python 6042  6040  6034 S+   grep -i pythonkilling pg 6036Error killing 6036: [Errno 3] No such process

Not sure how to deal with that. Maybe you can put the waitpid in a while loop to reap all terminated child processes, and then proceed with pgkill()ing the rest.

But the answer to your question is you're getting EPERMs because you're not allowed to killpg a zombie process group leader (at least on Mac OS).

Also, this is verifiable outside python. If you put a sleep in there, find the pgrp of one of those zombies, and attempt to kill its process group, you also get EPERM:

$ kill -TERM -6115-bash: kill: (-6115) - Operation not permitted

Confirmed this also doesn't happen on Linux.


You apparently can't kill a process group that consists of zombies. When a process exits, it becomes a zombie until someone calls waitpid on it. Typically, init will take ownership of children whose parents have died, to avoid orphan zombie children.

So, the process is still "around" in some sense, but it gets no CPU time and ignores any kill commands sent directly to it. If a process group consists entirely of zombies, however, the behaviour appears to be that killing the process group throws EPERM instead of silently failing. Note that killing a process group containing non-zombies still succeeds.

Example program demonstrating this:

import osimport timeres = os.fork()if res:    time.sleep(0.2)    pgid = os.getpgid(res)    print pgid    while 1:        try:            print os.kill(-pgid, 9)        except Exception, e:            print e            break    print 'wait', os.waitpid(res, 0)    try:        print os.kill(-pgid, 9)    except Exception, e:        print eelse:    os.setpgid(0, 0)    while 1:        pass

The output looks like

56621None[Errno 1] Operation not permittedwait (56621, 9)[Errno 3] No such process

The parent kills the child with SIGKILL, then tries again. The second time, it gets EPERM, so it waits for the child (reaping it and destroying its process group). So, the third kill produces ESRCH as expected.


From adding more logging, it looks like sometimes killpg returns EPERM instead of ESRCH:

#!/usr/bin/pythonfrom signal import SIGTERMfrom sys import exitfrom time import sleepfrom os import *def slow():    fork()    sleep(10)def fast():    sleep(1)child_pids = []for child_func in [fast, slow, slow, fast]:    pid = fork()    if pid == 0:        setsid()        print child_func, getpid(), getuid(), geteuid()        child_func()        exit(0)    else:        child_pids.append(pid)print waitpid(-1, 0)for child_pid in child_pids:    try:        print child_pid, getpgid(child_pid)    except OSError as e:        print "Error getpgid %s: %s" %(child_pid, e)          try:        killpg(child_pid, SIGTERM)    except OSError as e:        print "Error killing %s: %s" %(child_pid, e)

Whenever killpg fails with EPERM, getpgid has previously failed with ESRCH. For example:

<function fast at 0x109950d70> 26561 503 503<function slow at 0x109950a28> 26562 503 503<function slow at 0x109950a28> 26563 503 503<function fast at 0x109950d70> 26564 503 503(26564, 0)26561 Error getpgid 26561: [Errno 3] No such processError killing 26561: [Errno 1] Operation not permitted26562 2656226563 2656326564 Error getpgid 26564: [Errno 3] No such processError killing 26564: [Errno 3] No such process

I have no idea why this happens—whether it's legal behavior, or a bug in Darwin (inherited from FreeBSD or otherwise), etc.

It seems like you could work around it like this by double-checking an EPERM by calling kill(child_pid, 0); if that returns ESRCH there's no actual permission problem. Of course this looks pretty ugly in the code:

for child_pid in child_pids:    try:        killpg(child_pid, SIGTERM)    except OSError as e:        if e.errno != 3: # 3 == no such process            if e.errno == 1:                try:                    kill(child_pid, 0)                except OSError as e2:                    if e2.errno != 3:                        print "Error killing %s: %s" %(child_pid, e)            else:                print "Error killing %s: %s" %(child_pid, e)