How should I clean up hung grandchild processes when an alarm trips in Perl? How should I clean up hung grandchild processes when an alarm trips in Perl? unix unix

How should I clean up hung grandchild processes when an alarm trips in Perl?


I've read the question a few times, and I think I sort of get what youare trying to do. You have a control script. This script spawnschildren to do some stuff, and these children spawn the grandchildrento actually do the work. The problem is that the grandchildren can betoo slow (waiting for STDIN, or whatever), and you want to kill them.Furthermore, if there is one slow grandchild, you want the entirechild to die (killing the other grandchildren, if possible).

So, I tried implementing this two ways. The first was to make theparent spawn a child in a new UNIX session, set a timer for a fewseconds, and kill the entire child session when the timer went off.This made the parent responsible for both the child and thegrandchildren. It also didn't work right.

The next strategy was to make the parent spawn the child, and thenmake the child responsible for managing the grandchildren. It wouldset a timer for each grandchild, and kill it if the process hadn'texited by expiration time. This works great, so here is the code.

We'll use EV to manage the children and timers, and AnyEvent for theAPI. (You can try another AnyEvent event loop, like Event or POE.But I know that EV correctly handles the condition where a child exitsbefore you tell the loop to monitor it, which eliminates annoying raceconditions that other loops are vulnerable to.)

#!/usr/bin/env perluse strict;use warnings;use feature ':5.10';use AnyEvent;use EV; # you need EV for the best child-handling abilities

We need to keep track of the child watchers:

# active child watchersmy %children;

Then we need to write a function to start the children. The thingsthe parent spawns are called children, and the things the childrenspawn are called jobs.

sub start_child($$@) {    my ($on_success, $on_error, @jobs) = @_;

The arguments are a callback to be called when the child completessuccessfully (meaning its jobs were also a success), a callback whenthe child did not complete successfully, and then a list of coderefjobs to run.

In this function, we need to fork. In the parent, we setup a childwatcher to monitor the child:

    if(my $pid = fork){ # parent        # monitor the child process, inform our callback of error or success        say "$$: Starting child process $pid";        $children{$pid} = AnyEvent->child( pid => $pid, cb => sub {            my ($pid, $status) = @_;            delete $children{$pid};            say "$$: Child $pid exited with status $status";            if($status == 0){                $on_success->($pid);            }            else {                $on_error->($pid);            }        });    }

In the child, we actually run the jobs. This involves a little bit ofsetup, though.

First, we forget the parent's child watchers, because it doesn't makesense for the child to be informed of its siblings exiting. (Fork isfun, because you inherit all of the parent's state, even when thatmakes no sense at all.)

    else { # child        # kill the inherited child watchers        %children = ();        my %timers;

We also need to know when all the jobs are done, and whether or notthey were all a success. We use a counting conditional variable todetermine when everything has exited. We increment on startup, anddecrement on exit, and when the count is 0, we know everything's done.

I also keep a boolean around to indicate error state. If a processexits with a non-zero status, error goes to 1. Otherwise, it stays 0.You might want to keep more state than this :)

        # then start the kids        my $done = AnyEvent->condvar;        my $error = 0;        $done->begin;

(We also start the count at 1 so that if there are 0 jobs, our processstill exits.)

Now we need to fork for each job, and run the job. In the parent, wedo a few things. We increment the condvar. We set a timer to killthe child if it's too slow. And we setup a child watcher, so we canbe informed of the job's exit status.

    for my $job (@jobs) {            if(my $pid = fork){                say "[c] $$: starting job $job in $pid";                $done->begin;                # this is the timer that will kill the slow children                $timers{$pid} = AnyEvent->timer( after => 3, interval => 0, cb => sub {                    delete $timers{$pid};                    say "[c] $$: Killing $pid: too slow";                    kill 9, $pid;                });                # this monitors the children and cancels the timer if                # it exits soon enough                $children{$pid} = AnyEvent->child( pid => $pid, cb => sub {                    my ($pid, $status) = @_;                    delete $timers{$pid};                    delete $children{$pid};                    say "[c] [j] $$: job $pid exited with status $status";                    $error ||= ($status != 0);                    $done->end;                });            }

Using the timer is a little bit easier than alarm, since it carriesstate with it. Each timer knows which process to kill, and it's easyto cancel the timer when the process exits successfully -- we justdelete it from the hash.

That's the parent (of the child). The child (of the child; or thejob) is really simple:

            else {                # run kid                $job->();                exit 0; # just in case            }

You could also close stdin here, if you wanted to.

Now, after all the processes have been spawned, we wait for them toall exit by waiting on the condvar. The event loop will monior thechildren and timers, and do the right thing for us:

        } # this is the end of the for @jobs loop        $done->end;        # block until all children have exited        $done->recv;

Then, when all the children have exited, we can do whatever cleanupwork we want, like:

        if($error){            say "[c] $$: One of your children died.";            exit 1;        }        else {            say "[c] $$: All jobs completed successfully.";            exit 0;        }    } # end of "else { # child"} # end of start_child

OK, so that's the child and grandchild/job. Now we just need to writethe parent, which is a lot easier.

Like the child, we are going to use a counting condvar to wait for ourchildren.

# main programmy $all_done = AnyEvent->condvar;

We need some jobs to do. Here's one that is always successful, andone that will be successful if you press return, but will fail if youjust let it be killed by the timer:

my $good_grandchild = sub {    exit 0;};my $bad_grandchild = sub {    my $line = <STDIN>;    exit 0;};

So then we just need to start the child jobs. If you remember wayback to the top of start_child, it takes two callbacks, an errorcallback, and a success callback. We'll set those up; the errorcallback will print "not ok" and decrement the condvar, and thesuccess callback will print "ok" and do the same. Very simple.

my $ok  = sub { $all_done->end; say "$$: $_[0] ok" };my $nok = sub { $all_done->end; say "$$: $_[0] not ok" };

Then we can start a bunch of children with even more grandchildrenjobs:

say "starting...";$all_done->begin for 1..4;start_child $ok, $nok, ($good_grandchild, $good_grandchild, $good_grandchild);start_child $ok, $nok, ($good_grandchild, $good_grandchild, $bad_grandchild);start_child $ok, $nok, ($bad_grandchild, $bad_grandchild, $bad_grandchild);start_child $ok, $nok, ($good_grandchild, $good_grandchild, $good_grandchild, $good_grandchild);

Two of those will timeout, and two will succeed. If you press enterwhile they're running, though, then they might all succeed.

Anyway, once those have started, we just need to wait for them tofinish:

$all_done->recv;say "...done";exit 0;

And that's the program.

One thing that we aren't doing that Parallel::ForkManager does is"rate limiting" our forks so that only n children are running at atime. This is pretty easy to manually implement, though:

 use Coro; use AnyEvent::Subprocess; # better abstraction than manually                           # forking and making watchers use Coro::Semaphore; my $job = AnyEvent::Subprocess->new(    on_completion => sub {}, # replace later    code          => sub { the child process }; ) my $rate_limit = Coro::Semaphore->new(3); # 3 procs at a time my @coros = map { async {     my $guard = $rate_limit->guard;     $job->clone( on_completion => Coro::rouse_cb )->run($_);     Coro::rouse_wait; }} ({ args => 'for first job' }, { args => 'for second job' }, ... ); # this waits for all jobs to complete my @results = map { $_->join } @coros;

The advantage here is that you can do other things while your childrenare running -- just spawn more threads with async before you do theblocking join. You also have a lot more control over the childrenwith AnyEvent::Subprocess -- you can run the child in a Pty and feedit stdin (like with Expect), and you can capture its stdin and stdoutand stderr, or you can ignore those things, or whatever. You get todecide, not some module author that's trying to make things "simple".

Anyway, hope this helps.


Brian - it's a bit crude and non-idiomatic, but one approach I've seen taken is this: anytime you fork, you:

  1. Give the child process a first "-id" dummy parameter to the program, with a somewhat unique (per PID) value - a good candidate could be up-to-millisecond timestamp + parent's PID.

  2. The parent records the child PID and a -id value into a (ideally, persistent) registry along with the desired timeout/kill time.

Then have a watcher process (either the ultimate grandparent or a separate process with the same UID) simply cycle through the registry periodically, and check which processes needing to be killed (as per to-kill-time) are still hanging around (by matching both PID and "-id" parameter value in the registry with the PIDs and command line in process table); and send signal 9 to such process (or be nice and try to kill gently first by trying to send signal 2).

The unique "-id" parameter is obviously intended to prevent killing some innocent process that just happened to re-use a prior process's PID by coincidence, which is probably likely given the scale you mentioned.

The idea of a registry helps with the problem of "already disassociated" grand-children since you no longer depend on the system to keep parent/child association for you.

This is kind of brute force, but since nobody answered yet I figured I'll though my 3 cents worth of an idea your way.


I have to solve this same problem in a module I've been working on. I'm not completely satisfied with all of my solution(s) either, but what generally works on Unix is to

  1. change a child's process group
  2. spawn grandchildren as necessary
  3. change the child's process group again (say, back to its original value)
  4. signal the grandchildren's process group to kill the grandchildren

Something like:

use Time::HiRes qw(sleep);sub be_sleepy { sleep 2 ** (5 * rand()) }$SIGINT = 2;for (0 .. $ARGV[1]) {    print ".";    print "\n" unless ++$count % 50;    if (fork() == 0) {           # a child process        # $ORIGINAL_PGRP and $NEW_PGRP should be global or package or object level vars        $ORIGINAL_PGRP = getpgrp(0);        setpgrp(0, $$);        $NEW_PGRP = getpgrp(0);        local $SIG{ALRM} = sub {            kill_grandchildren();            die "$$ timed out\n";        };        eval {            alarm 2;            while (rand() < 0.5) {                if (fork() == 0) {                    be_sleepy();                }            }            be_sleepy();            alarm 0;            kill_grandchildren();        };        exit 0;    }}sub kill_grandchildren {    setpgrp(0, $ORIGINAL_PGRP);    kill -$SIGINT, $NEW_PGRP;   # or  kill $SIGINT, -$NEW_PGRP}

This isn't completely fool proof. The grandchildren might change their process groups or trap signals.

None of this will work on Windows, of course, but let's just say that TASKKILL /F /T is your friend.


Update: This solution doesn't handle (for me, anyway) the case when the child process invokes system "perl -le '<STDIN>'". For me, this immediately suspends the process, and prevents the SIGALRM from firing and the SIGALRM handler from running. Is closing STDIN the only workaround?