Handling sub-supervisor failures in nested supervision trees from Demi Marie Obenour on 2025-09-16 (supervision)

From: Demi Marie Obenour <demiobenour_at_gmail.com>
Date: Tue, 16 Sep 2025 14:00:22 -0400

I have been trying to figure out how to handle failures of
sub-supervisors in nested supervision trees. Right now, it seems that
if a sub-supervisor (like s6-supervise) dies, its supervisor (like
s6-svscan) will respawn it, but the respawned s6-supervise won't know
about the job it was supposed to spawn. This means that it can either
risk spawning a second instance or never restarting it, neither of
which is good.

One workaround is for the sub-supervisor and the process it supervises
to share a process group. The sub-supervisor and its parent can both
send signals to the entire group, and can wait on child processes in
that group to finish. The parent can kill the entire process group if
the sub-supervisor dies, and wait for all the processes in it to exit
before respawning the sub-supervisor. This does mean that the child
will wind up sending itself a SIGKILL if it is done with its job.
However, this can actually be okay. The parent can re-spawn the
sub-supervisor after waiting for all of its children.

Unfortunately, this does not work for nested supervision trees. I did
figure out a very ugly workaround, but it requires support from init or
(possibly) the (Linux-specific) prctl(PR_SET_CHILD_SUBREAPER).

How do other projects handle this? The best solution I can think of is
to use control groups, which are Linux-specific but are a perfect fit
for the job. Non-Linux systems don't allow replacing init and don't
provide prctl(PR_SET_CHILD_SUBREAPER), so the trick I came up with
doesn't work anyway.

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

application/pgp-keys attachment: OpenPGP public key

application/pgp-signature attachment: OpenPGP digital signature

Received on Tue Sep 16 2025 - 20:00:22 CEST

This archive was generated by hypermail 2.4.0 : Tue Sep 16 2025 - 20:01:08 CEST