Re: s6-rc transition failures

From: Laurent Bercot <ska-supervision_at_skarnet.org>
Date: Thu, 15 Jun 2017 19:44:16 +0000

>I am facing questions regarding the way to correctly handle
>transition failures with s6-rc. The new permanent failure feature
>already clarifies some scenarios but I still have doubts about
>some cases. Below are two concrete examples. I would
>be happy to have remarks or suggestions about how to cope
>with them clean and nice :).

  First, thank you for your mail. This is exactly the kind of feedback
that I'm looking for regarding s6-rc: identifying pain points and
usability issues.


>1. I start a longrun service with "s6-rc -u change svc". This
>service hangs and never reaches readiness notification. After
>timeout s6-rc will declare the transition a failure. But the process
>is actually running and I have no way to stop it through s6-rc.
>The only way is to issue "s6-svc -d /path/to/svc". But then I have
>the feeling I am doing something in the back of s6-rc to unblock
>the situation because s6-rc cannot handle it.

  That is a fair point. Normally, you should adjust the s6-rc
timeouts (both the global one and the service-specific one) to
make sure s6-rc does *not* time out before the service is ready -
but if there's an unexpected significant delay, the situation can
happen.

  In general it does not matter that s6-rc is unaware that a service
is up: when s6-rc reports a (temporary) transition failure, the
expected user action is to run the command again. s6-rc then picks
up the correct service states during its second execution.
  But if you're running a s6-rc -d change operation right after a
transition failure, it is true that states could become inconsistent.

  What I can do is add an option to s6-rc to make it explicitly send
a s6-svc -d to a service that times out before reaching readiness:
ensure that a service is either ready in time, or definitely down.
Would that help?
  The annoying thing is it can't be symmetrical: when a down
transition times out, there's no way I'm going to start the service
again. :) But generally, a down transition timing out signifies a
badly written finish script, or badly calibrated timeouts, and
it can be easily solved by running s6-rc -d change again.


>2. Slightly related, I have an issue with system shutdown. I am
>working on a buildroot system and specifically I use the
>/etc/rc.tini which can be found here [1] and which is executed
>as part of the shutdown sequence of the system. The problem
>is with the invocation of "s6-rc -b -da change" (I added the -b).
>If there is already an s6-rc ongoing, the shutdown sequence will
>be blocked until the first s6-rc times out. And this kind of timeout
>is of the order of minutes as I have slow services depending
>on each other. I currently think the best thing to do is to is to
>"killall s6-rc" before calling "s6-rc -ad change".

  Yes. Since the state is global, it makes sense to refuse to start
a state change while another one is taking place, unless you're willing
to abort the ongoing operation by explicitly killing the running
s6-rc process.


> This leaves a little
>race condition possible, but more importantly, I have concerns
>about killing an ongoing s6-rc. This will leave longrun services
>in the middle of a state transition - there is the connection with
>the first scenario - and I expect the final effect is that the
>finish script will not be executed before the system goes down,
>which is precisely what I want to happen when I call
>"s6-rc -ad change". Secondly, I do not know what effect this will
>have on oneshots. I fear "/etc/init.d/S98xxx start" will still be
>running and "/etc/init.d/S98xxx stop" will be executed - the thought
>of which horrifies me beyond reasoning.

  And that's exactly why there's a lock preventing several state
changes from running concurrently. :)
  What I can do is add a bit of signal handling to s6-rc, so that if
it gets interrupted, say with a SIGINT or SIGTERM, it exits ASAP,
while still ensuring consistency of the service states.

  Unfortunately, for oneshots it would mean waiting for the current
transitions to finish before exiting - s6-rc has no way to interrupt
a running oneshot, and adding one (making s6rc-oneshot-runner kill
all its children) would not help, because until the oneshot script
exits, it is not visible from the outside whether it has accomplished
its transition or not - so the state would still be undetermined.

  Also, state consistency cannot be 100% ensured, because s6-rc could
still receive a SIGKILL - but if you kill -9 s6-rc, you deserve
trouble.

  What do you think?

--
  Laurent
Received on Thu Jun 15 2017 - 19:44:16 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC