Re: s6: something like runit's ./check script

From: Laurent Bercot <ska-supervision_at_skarnet.org>
Date: Tue, 8 Sep 2015 15:42:57 +0200

On 08/09/2015 15:04, Jan Bramkamp wrote:
> Not if something kills the polling script e.g. stray kill -9 $WRONG_PID.

  Yeah, yeah. It's a question of risk assessment.
  We supervise long-lived processes because in the course of their
lives, they may receive a stray signal, but more likely, they may
die from a bug or a temporary error. Stray signals are actually
very rare.
  The likelihood of a short-lived process receiving a stray signal
is very, very low. If it were any higher, Unix simply would not
work, because you could never count on any process staying alive
long enough to actually perform its job. Well, that's not the case:
processes don't die on a whim - they die because they're buggy or
they're lacking resources.

  We supervise processes when the cost of not supervising them is
higher than the cost of supervising them. It makes sense for daemons.
It would not make sense for a polling process.

  Say you're incredibly unlucky and a stray signal hits your poller.
What happens then? Your daemon doesn't get killed, and it's not
ready.
  Tough.
  The same situation can happen with any daemon you don't poll for
readiness. If a daemon uses notification and gets stuck, well, it
never notifies readiness, and that's it. It doesn't get killed for
it.

  If you, as an admin, estimate that it's a risk you cannot take,
i.e. the probability of your daemon getting stuck multiplied by
the cost of the consequences is too high a number, then you should
do something about it.

  And the great thing is that you already can. Set up a listener
on the service's notification channel that kills the daemon when
too much time elapses between the 'u' and the 'U' event. Done.
Another possibility, if your daemon is critical, and you have a
poller for it: set up a long-lived monitor for your daemon, and
restart it whenever the monitor fails, without even using the
s6 notification channel.

  When you have a service that's critical enough to make you want
to protect against stray signals hitting short-lived processes,
that's the kind of thing you want to do anyway. You're not going
to rely on a ./check poll at the beginning of the run script.

  For everything else, a short-lived background process is more
than enough.

-- 
  Laurent
Received on Tue Sep 08 2015 - 13:42:57 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC