Great and detail explanation. You are right about the resource footprint of
running supervise program isn't expensive and it should be another problem
if runsv dies away. I am just trying to simulate crash and see what
happened, that's what I observed and wondering if we could fine tune every
parts which makes it more reliable for our case, which doesn't seems
possible but that's fine.
I am wondering how does Solaris do their supervision? Their supervision
program is well known for solid running.
On Jun 23, 2016 8:41 PM, "Laurent Bercot" <ska-supervision_at_skarnet.org>
wrote:
> On 23/06/2016 03:46, Thomas Lau wrote:
>
>> LOL, well I am trying to do drill test and see how resilience of runit
>> could be, this is one of the minor downfall.
>>
>
>  Current supervisors have no way of knowing that they died and
> their child is still running. Hence, when they start again, they attempt
> to run their child again, which will probably fail since the old instance
> of the child is still running. So, they will periodically try and start
> the child again, only to fail again, and so on.
>  On daemontools and s6, the period is 1 second. I'm not sure about runit,
> but it should be around 1s too.
>
>  Yes, it is a problem, and I don't like that behaviour much, but the
> alternatives are actually worse. Currently, the consequences of the
> issue are that when a supervisor dies and restarts:
>  - depending on the run script, the daemon's logs are flooded with error
> messages from the run script failing to exec into the daemon.
>  - Every second, some CPU is used to try and start the daemon.
>
>
>  I think those drawbacks are acceptable and trying to fix them is not a
> good idea:
>
>  - Supervisors dying without their daemons dying are an extremely rare
> occurrence, not worth specialcasing unless it causes systemic,
> unrecoverable
> failure which is not the case.
>  - What we'd want ideally: the new instance of the supervisor would "grab"
> the old instance of the daemon. But that is impossible under Unix, and
> any attempt to do that is doomed to use the same hacks that non-supervision
> systems use and that supervision aims to step away from.
>  - Any attempt to kill the old instance of the daemon in order to properly
> start a new supervised instance is a policy decision, which belongs to the
> admin; the supervisor program can't make that decision automatically.
>  - As is, even if the supervisor dies, the service keeps running; its in
> "degraded mode" because the current instance isn't watched by a supervisor,
> but it's still running, and that's what important. And if the daemon dies,
> a new, supervised instance will automatically take its place, as if the
> supervisor had never died: things will fix themselves on their own.
>  - For critical services, the log flooding should trigger an alerting
> system
> that will notify the admins that there's a problem, and appropriate action
> can then be taken (i.e. either do nothing or kill the current instance of
> the daemon).
>  - The periodic attempt to start a new instance of the daemon is generally
> not expensive. This is one of the reasons for the 1s respawning period: it
> gives the system time to breathe, without the "respawning too fast" problem
> that can be observed with, for instance, sysvinit. If the daemon uses a lot
> of resources before it notices it cannot succeed, that's a design issue
> in the daemon, not the supervisor; and even in that case, on critical
> machines there should be an alerting system that notices the spike in
> resource usage and notifies the admins.
>  - Attempts to handle that edge case in the supervisor itself would add a
> lot
> (a real whole lot) of complexity, for very uncertain benefits.
>
>  So, yeah. Even if your logs freak out, your memcached is still running,
> and that's what you want. And stop voluntarily killing your runsv for
> testing purposes: the day when your runsv accidentally dies before the
> daemon it's supervising is the day when something's seriously wrong with
> your system and you have much bigger problems than spurious log messages.
>
> --
>  Laurent
>
>
Received on Fri Jun 24 2016 - 00:33:50 UTC
This archive was generated by hypermail 2.3.0
: Sun May 09 2021 - 19:44:19 UTC