Defunct processes using CPU Defunct processes using CPU unix unix

Defunct processes using CPU


A zombie process (i.e. one that is 'defunct') does not consume CPU: it is simply retained by the kernel so that the parent process can retrieve information about it (e.g. return status, resource usage, etc...).

The CPU usage indicated by the ps command corresponds to the CPU usage whilst the process was running: that is, before it terminated and became a zombie.


Those are Zombie processes as indicated by the Z in the stat column - they won't be cleaned up until their parent process is terminated. I don't know much about python but presumably you called fork or similar within your python interpreter to spawn them. Kill the interpreter and the zombies will be reaped (cleaned up).

Try the "top" command if you want up to date info on CPU.

Also as an aside I prefer ouput from "ps -ef" rather then "ps aux" aux always struck me as a nonstandard hack (hence lack of a '-' for seperating command and argument) it also fails to work on a lot of other Unix systems like HPUX, AIX etc.

"ps -ef" shows ppid (parent pid) which helps you track down problems like this.


Interestingly, and perhaps confusingly, I have a zombie process as of this moment which is accumulating CPU time on my system. So the question is, why? Common wisdom is that any output from ps which shows a zombie process means that the only thing in use is the process table entry; from wikipedia: "...zombie process or defunct process is a process that has completed execution (via the exit system call) but still has an entry in the process table: it is a process in the 'Terminated state'. " and from unix.stackexchange: https://unix.stackexchange.com/questions/11172/how-can-i-kill-a-defunct-process-whose-parent-is-init "Zombie processes take up almost no resouces so there is no performance cost in letting them linger."

So I have a zombie process:

# ps -e -o pid,ppid,stat,comm| grep Z 7296     1 Zl   myproc <defunct>

Which appears to be using CPU time:

# ps -e -o pid,ppid,bsdtime,stat,comm| grep Z; sleep 10; ps -e -o pid,ppid,bsdtime,stat,comm | grep Z 7296     1  56:00 Zl   myproc <defunct> 7296     1  56:04 Zl   myproc <defunct>

So how can a Zombie process accumulate CPU time?

I changed my search:

# ps -eT -o pid,lwp,ppid,bsdtime,stat,comm| grep 7296  7296  7296     1   1:29 Zl   myproc <defunct> 7296  8009     1  56:11 Dl   myproc

and I see that I have a thread that is running, and using system i/o. Indeed, if I do this, I can see field 15 (stime) changing:

# watch -d -n 1 cat /proc/8009/statEvery 1.0s: cat /proc/8009/stat                  Fri Jun  4 11:19:55 20218009 (myproc) D 1 7295 7295 0 -1 516 18156428 12281 37 0 11609 344755

(trimmed at field 15)

So I attempt to kill process 8009 with a TERM... didn't work. Killing it with a KILL is fruitless as well.

Sounds like a kernel bug to me. I did try to strace it, which was foolish because now my strace won't exit.

This is on RHEL 7.7 with kernel 3.10.0-1062. Old at this time, but young enough to conclude (in my mind) that a Zombie process could accumulate system resources due to a bug somewhere.

By the way, according to iotop our i/o as peaking at 4 GBps, which is a lot. I think this thing is definitely having an impact on our system and I want to reboot.

ls output of /proc/8009 returns this:

# ls -l /proc/8009ls: cannot read symbolic link /proc/8009/cwd: No such file or directoryls: cannot read symbolic link /proc/8009/root: No such file or directoryls: cannot read symbolic link /proc/8009/exe: No such file or directory

(normal /proc/pid output follows... but I trimmed it)

/proc/8009/fd is empty. So even though I have a significant amount of i/o taking place, it's not writing to any files. I don't see filesystem space getting used, as show by df -h output.

Finally: trying to reboot is proving impossible. shutdown -r now is not working. There are a couple of systemd processes that are stuck in i/o wait:

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command22725 root       20   0  129M  2512  1548 R  0.0  0.0  0:00.19 htop22227 root       20   0  195M  4776  2652 D  0.0  0.0  0:00.00 /usr/lib/systemd/systemd --switched-root --system --deserialize 22    1 root       20   0  195M  4776  2652 D  0.0  0.0  0:58.41 /usr/lib/systemd/systemd --switched-root --system --deserialize 22

Here's shutdown output. I'd say init is quite confused at this point:

# shutdown -r nowFailed to open /dev/initctl: No such device or addressFailed to talk to init daemon.

reboot says the same thing. I'm gonna have to pull the plug on this machine.

...Update: Just as I logged into the console, the system rebooted! It probably took about 10 minutes. So I don't know what systemd was doing but it was doing something.

...Another update: So I have 3 machines that this happened to today, all sharing the same characteristics: Same binary, some sort of behavior (no open file descriptors, but i/o taking place, two threads, child thread is accumulating CPU time). As @Stephane Chazelas mentioned, I performed a stack trace. Here's a typical output; I'm not very kernel-savvy but perhaps it's of interest to some interloper in the future... note that 242603 is the parent thread, 242919 is the child that's busy:

# grep -H . /proc/242919/task/*/stack/proc/242919/task/242603/stack:[<ffffffff898a131e>] do_exit+0x6ce/0xa50/proc/242919/task/242603/stack:[<ffffffff898a171f>] do_group_exit+0x3f/0xa0/proc/242919/task/242603/stack:[<ffffffff898b252e>] get_signal_to_deliver+0x1ce/0x5e0/proc/242919/task/242603/stack:[<ffffffff8982c527>] do_signal+0x57/0x6f0/proc/242919/task/242603/stack:[<ffffffff8982cc32>] do_notify_resume+0x72/0xc0/proc/242919/task/242603/stack:[<ffffffff89f8c23b>] int_signal+0x12/0x17/proc/242919/task/242603/stack:[<ffffffffffffffff>] 0xffffffffffffffff/proc/242919/task/242919/stack:[<ffffffffc09cbb03>] ext4_mb_new_blocks+0x653/0xa20 [ext4]/proc/242919/task/242919/stack:[<ffffffffc09c0a36>] ext4_ext_map_blocks+0x4a6/0xf60 [ext4]/proc/242919/task/242919/stack:[<ffffffffc098fcf5>] ext4_map_blocks+0x155/0x6e0 [ext4]/proc/242919/task/242919/stack:[<ffffffffc0993cfa>] ext4_writepages+0x6da/0xcf0 [ext4]/proc/242919/task/242919/stack:[<ffffffff899c8d31>] do_writepages+0x21/0x50/proc/242919/task/242919/stack:[<ffffffff899bd4b5>] __filemap_fdatawrite_range+0x65/0x80/proc/242919/task/242919/stack:[<ffffffff899bd59c>] filemap_flush+0x1c/0x20/proc/242919/task/242919/stack:[<ffffffffc099116c>] ext4_alloc_da_blocks+0x2c/0x70 [ext4]/proc/242919/task/242919/stack:[<ffffffffc098a4d9>] ext4_release_file+0x79/0xc0 [ext4]/proc/242919/task/242919/stack:[<ffffffff89a4a9cc>] __fput+0xec/0x260/proc/242919/task/242919/stack:[<ffffffff89a4ac2e>] ____fput+0xe/0x10/proc/242919/task/242919/stack:[<ffffffff898c1c0b>] task_work_run+0xbb/0xe0/proc/242919/task/242919/stack:[<ffffffff898a0f24>] do_exit+0x2d4/0xa50/proc/242919/task/242919/stack:[<ffffffff898a171f>] do_group_exit+0x3f/0xa0/proc/242919/task/242919/stack:[<ffffffff898b252e>] get_signal_to_deliver+0x1ce/0x5e0/proc/242919/task/242919/stack:[<ffffffff8982c527>] do_signal+0x57/0x6f0/proc/242919/task/242919/stack:[<ffffffff8982cc32>] do_notify_resume+0x72/0xc0/proc/242919/task/242919/stack:[<ffffffff89f8256c>] retint_signal+0x48/0x8c/proc/242919/task/242919/stack:[<ffffffffffffffff>] 0xffffffffffffffff