Re: runit kill runsv

From: Thomas Lau <tlau_at_tetrioncapital.com>
Date: Fri, 24 Jun 2016 08:33:50 +0800

Great and detail explanation. You are right about the resource footprint of
running supervise program isn't expensive and it should be another problem
if runsv dies away. I am just trying to simulate crash and see what
happened, that's what I observed and wondering if we could fine tune every
parts which makes it more reliable for our case, which doesn't seems
possible but that's fine.

I am wondering how does Solaris do their supervision? Their supervision
program is well known for solid running.
On Jun 23, 2016 8:41 PM, "Laurent Bercot" <ska-supervision_at_skarnet.org>
wrote:

> On 23/06/2016 03:46, Thomas Lau wrote:
>
>> LOL, well I am trying to do drill test and see how resilience of runit
>> could be, this is one of the minor downfall.
>>
>
> Current supervisors have no way of knowing that they died and
> their child is still running. Hence, when they start again, they attempt
> to run their child again, which will probably fail since the old instance
> of the child is still running. So, they will periodically try and start
> the child again, only to fail again, and so on.
> On daemontools and s6, the period is 1 second. I'm not sure about runit,
> but it should be around 1s too.
>
> Yes, it is a problem, and I don't like that behaviour much, but the
> alternatives are actually worse. Currently, the consequences of the
> issue are that when a supervisor dies and restarts:
> - depending on the run script, the daemon's logs are flooded with error
> messages from the run script failing to exec into the daemon.
> - Every second, some CPU is used to try and start the daemon.
>
>
> I think those drawbacks are acceptable and trying to fix them is not a
> good idea:
>
> - Supervisors dying without their daemons dying are an extremely rare
> occurrence, not worth specialcasing unless it causes systemic,
> unrecoverable
> failure which is not the case.
> - What we'd want ideally: the new instance of the supervisor would "grab"
> the old instance of the daemon. But that is impossible under Unix, and
> any attempt to do that is doomed to use the same hacks that non-supervision
> systems use and that supervision aims to step away from.
> - Any attempt to kill the old instance of the daemon in order to properly
> start a new supervised instance is a policy decision, which belongs to the
> admin; the supervisor program can't make that decision automatically.
> - As is, even if the supervisor dies, the service keeps running; its in
> "degraded mode" because the current instance isn't watched by a supervisor,
> but it's still running, and that's what important. And if the daemon dies,
> a new, supervised instance will automatically take its place, as if the
> supervisor had never died: things will fix themselves on their own.
> - For critical services, the log flooding should trigger an alerting
> system
> that will notify the admins that there's a problem, and appropriate action
> can then be taken (i.e. either do nothing or kill the current instance of
> the daemon).
> - The periodic attempt to start a new instance of the daemon is generally
> not expensive. This is one of the reasons for the 1s respawning period: it
> gives the system time to breathe, without the "respawning too fast" problem
> that can be observed with, for instance, sysvinit. If the daemon uses a lot
> of resources before it notices it cannot succeed, that's a design issue
> in the daemon, not the supervisor; and even in that case, on critical
> machines there should be an alerting system that notices the spike in
> resource usage and notifies the admins.
> - Attempts to handle that edge case in the supervisor itself would add a
> lot
> (a real whole lot) of complexity, for very uncertain benefits.
>
> So, yeah. Even if your logs freak out, your memcached is still running,
> and that's what you want. And stop voluntarily killing your runsv for
> testing purposes: the day when your runsv accidentally dies before the
> daemon it's supervising is the day when something's seriously wrong with
> your system and you have much bigger problems than spurious log messages.
>
> --
> Laurent
>
>
Received on Fri Jun 24 2016 - 00:33:50 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC