How to run s6-svscan as process 1

Since 2015-06-17, if you're a Linux user, you can use the s6-linux-init package to help you do so! Please read this documentation page first, though, it will help you understand what s6-linux-init does.

It is possible to run s6-svscan as process 1, i.e. the init process. However, that does not mean you can directly boot on s6-svscan; that little program cannot do everything your stock init does. Replacing the init process requires a bit of understanding of what is going on.

The three stages of init

Okay, it's actually four, but the fourth stage is an implementation detail that users don't care about, so we'll stick with three.

The life of a Unix machine has three stages. Yes, three.

The early initialization phase. It starts when the kernel launches the first userland process, traditionally called init. During this phase, init is the only lasting process; its duty is to prepare the machine for the start of other long-lived processes, i.e. services. Work such as mounting filesystems, setting the system clock, etc. can be done at this point. This phase ends when process 1 launches its first services.
The cruising phase. This is the "normal", stable state of an up and running Unix machine. Early work is done, and init launches and maintains services, i.e. long-lived processes such as gettys, the ssh server, and so on. During this phase, init's duties are to reap orphaned zombies and to supervise services - also allowing the administrator to add or remove services. This phase ends when the administrator requires a shutdown.
The shutdown phase. Everything is cleaned up, services are stopped, filesystems are unmounted, the machine is getting ready to be halted. At the end of this phase, all processes are killed, first with a SIGTERM, then with a SIGKILL (to catch processes that resist SIGTERM). The only processes that survive it are process 1; if this process is s6-svscan and its scandir is not empty, then the supervision tree is restarted.
The hardware shutdown phase. The system clock is stored, filesystems are unmounted, and the system call that reboots the machine or powers it off is called.

Unless you're implementing a shutdown procedure over a supervision tree, you can absolutely consider that the hardware shutdown is part of stage 3.

As you can see, process 1's duties are radically different from one stage to the next, and init has the most work when the machine is booting or shutting down, which means a normally negligible fraction of the time it is up. The only common thing is that at no point is process 1 allowed to exit.

Still, all common init systems insist that the same init executable must handle these three stages. From System V init to launchd, via busybox init, you name it - one init program from bootup to shutdown. No wonder those programs, even basic ones, seem complex to write and complex to understand!

Even the runit program, designed with supervision in mind, remains as process 1 all the time; at least runit makes things simple by clearly separating the three stages and delegating every stage's work to a different script that is not run as process 1. (Since runit does not distinguish between stage 3 and stage 4, it needs very careful handling of the kill -9 -1 part of stage 3: getting /etc/runit/3 killed before it unmounts the filesystems would be bad.)

One init to rule them all? It ain't necessarily so!

The role of s6-svscan

init does not have the right to die, but fortunately, it has the right to execve()! During stage 2, why use precious RAM, or at best, swap space, to store data that are only relevant to stages 1 or 3-4? It only makes sense to have an init process that handles stage 1, then executes into an init process that handles stage 2, and when told to shutdown, this "stage 2" init executes into a "stage 3" init which just performs shutdown. Just as runit does with the /etc/runit/[123] scripts, but exec'ing the scripts as process 1 instead of forking them.

It becomes clear now that s6-svscan is perfectly suited to exactly fulfill process 1's role during stage 2.

It does not die
The reaper takes care of every zombie on the system
The scanner maintains services alive
It can be sent commands via the s6-svscanctl interface
It execs into a given script when told to

However, an init process for stage 1 and another one for stage 3 are still needed. Fortunately, those processes are very easy to design! The only difficulty here is that they're heavily system-dependent, so it's not possible to provide a stage 1 init and a stage 3 init that will work everywhere. s6 was designed to be as portable as possible, and it should run on virtually every Unix platform; but outside of stage 2 is where portability stops.

The s6-linux-init package provides a tool, s6-linux-init-maker, to automatically create a suitable stage 1 init (so, the /sbin/init binary) for Linux. It is also possible to write similar tools for other operating systems, but the details are heavily system-dependent.

For the adventurous and people who need to do this by hand, though, here are are some general design tips.

How to design a stage 1 init

What stage 1 init must do

Prepare an initial scan directory, say in /run/service, with a few vital services, such as s6-svscan's own logger, and an early getty (in case debugging is needed). That implies mounting a read-write filesystem, creating it in RAM if needed, if the root filesystem is read-only.
Either perform all the one-time initialization, as stage 1 runit does;
or fork a process that will perform most of the one-time initialization once s6-svscan is in charge.
Be extremely simple and not fail, because recovery is almost impossible here.

Unlike the /etc/runit/1 script, an init-stage1 script running as process 1 has nothing to back it up, and if it fails and dies, the machine crashes. Does that mean the runit approach is better? It's certainly safer, but not necessarily better, because init-stage1 can be made extremely small, to the point it is practically failproof, and if it fails, it means something is so wrong that you would have had to reboot the machine with init=/bin/sh anyway.

To make init-stage1 as small as possible, only this realization is needed: you do not need to perform all of the one-time initialization tasks before launching s6-svscan. Actually, once init-stage1 has made it possible for s6-svscan to run, it can fork a background "init-stage2" process and exec into s6-svscan immediately! The "init-stage2" process can then pursue the one-time initialization, with a big advantage over the "init-stage1" process: s6-svscan is running, as well as a few vital services, and if something bad happens, there's a getty for the administrator to log on. No need to play fancy tricks with /dev/console anymore! Yes, the theoretical separation in 3 stages is a bit more flexible in practice: the "stage 2" process 1 can be already running when a part of the "stage 1" one-time tasks are still being run.

Of course, that means that the scan directory is still incomplete when s6-svscan first starts, because most services can't yet be run, for lack of mounted filesystems, network etc. The "init-stage2" one-time initialization script must populate the scan directory when it has made it possible for all wanted services to run, and trigger the scanner. Once all the one-time tasks are done, the scan directory is fully populated and the scanner has been triggered, the machine is fully operational and in stage 2, and the "init-stage2" script can die.

Is it possible to write stage 1 init in a scripting language?

It is very possible, and if you are attempting to write your own stage 1, I definitely recommend it. If you are using s6-svscan as stage 2 init, stage 1 init should be simple enough that it can be written in any scripting language you want, just as /etc/runit/1 is if you're using runit. And since it should be so small, the performance impact will be negligible, while maintainability is enhanced. Definitely make your stage 1 init a script.

Of course, most people will use the shell as scripting language; however, I advocate the use of execline for this, and not only for the obvious reasons. Piping s6-svscan's stderr to a logging service before said service is even up requires some tricky fifo handling that execline can do and the shell cannot.

How to design a stage 3-4 init

If you're using s6-svscan as stage 2 init on /run/service, then stage 3 init is naturally the /run/service/.s6-svscan/finish program. Of course, /run/service/.s6-svscan/finish can be a symbolic link to anything else; just make sure it points to something in the root filesystem (unless your program is an execline script, in which case it is not even necessary).

What stage 3-4 init must do

Destroy the supervision tree and stop all services
Kill all processes save itself, first gently, then harshly, and reap all the zombies.
Up until that point we were in stage 3; now we're in stage 4.
Unmount all the filesystems
Halt or reboot the machine, depending on what root asked for

This is seemingly very simple, even simpler than stage 1, but experience shows that it's trickier than it looks.

One tricky part is the kill -9 -1 operation at the end of stage 3: you must make sure that process 1 regains control and keeps running after it, because it will be the only process left alive. If you are running a stage 3 script as process 1, it is almost automatic: your script survives the kill and continues running, up into stage 4. If you are using another model, the behaviour becomes system-dependent: your script may or may not survive the kill, so on systems where it does not, you will have to design a way to regain control in order to accomplish stage 4 tasks.

Another tricky part, that is only apparent with practice, is solidity. It is even more vital that nothing fails during stages 3 and 4 than it is in stage 1, because in stage 1, the worst that can happen is that the machine does not boot, whereas in stages 3 and 4, the worst that can happen is that the machine does not shut down, and that is a much bigger issue.

For these reasons, I now recommend not tearing down the supervision tree for stages 3-4. It is easier to work in a stable environment, as a regular process, than it is to manage a whole shutdown sequence as pid 1: the presence of s6-svscan as pid 1, and of a working supervision tree, is a pillar you can rely on, and with experience I find it a good idea to keep the supervision infrastructure running until the end. Of course, that requires the scandir, and the active supervision directories, to be on a RAM filesystem such as tmpfs; that is good policy anyway.

Is it possible to write stage 3 init in a scripting language?

Yes, definitely, just like stage 1.

However, you really should leave /run/service/.s6-svscan/finish (and the other scripts in /run/service/.s6-svscan) alone, and write your shutdown sequence without dismantling the supervision tree. You will still have to stop most of the services, but s6-svscan should stay. For a more in-depth study of what to do in stages 3-4 and how to do it, you can look at the source of s6-linux-init-shutdownd in the s6-linux-init package.

How to log the supervision tree's messages

When the Unix kernel launches your (stage 1) init process, it does it with descriptors 0, 1 and 2 open and reading from or writing to /dev/console. This is okay for the early boot: you actually want early error messages to be displayed to the system console. But this is not okay for stage 2: the system console should only be used to display extremely serious error messages such as kernel errors, or errors from the logging system itself; everything else should be handled by the logging system, following the logging chain mechanism. The supervision tree's messages should go to the catch-all logger instead of the system console. (And the console should never be read, so no program should run with /dev/console as stdin, but this is easy enough to fix: s6-svscan will be started with stdin redirected from /dev/null.)

The catch-all logger is a service, and we want every service to run under the supervision tree. Chicken and egg problem: before starting s6-svscan, we must redirect s6-svscan's output to the input of a program that will only be started once s6-svscan is running and can start services.

There are several solutions to this problem, but the simplest one is to use a FIFO, a.k.a. named pipe. s6-svscan's stdout and stderr can be redirected to a named pipe before s6-svscan is run, and the catch-all logger service can be made to read from this named pipe. Only two minor problems remain:

If s6-svscan or s6-supervise writes to the FIFO before there is a reader, i.e. before the catch-all logging service is started, the write will fail (and a SIGPIPE will be emitted). This is not a real issue for an s6 installation because s6-svscan and s6-supervise ignore SIGPIPE, and they only write to their stderr if an error occurs; and if an error occurs before they are able to start the catch-all logger, this means that the system is seriously damaged (as if an error occurs during stage 1) and the only solution is to reboot with init=/bin/sh anyway.
Normal Unix semantics do not allow a writer to open a FIFO before there is a reader: if there is no reader when the FIFO is opened for writing, the open() system call blocks until a reader appears. This is obviously not what we want: we want to be able to actually start s6-svscan with its stdout and stderr pointing to the logging FIFO, even without a reader process, and we want it to run normally so it can start the logging service that will provide such a reader process.

This second point cannot be solved in a shell script, and that is why you are discouraged to write your stage 1 init script in the shell language: you cannot properly set up a FIFO output for s6-svscan without resorting to horrible and unreliable hacks involving a temporary background FIFO reader process.

Instead, you are encouraged to use the execline language - or, at least, the redirfd command, which is part of the execline distribution. The redirfd command does just the right amount of trickery with FIFOs for you to be able to properly redirect process 1's stdout and stderr to the logging FIFO without blocking: redirfd -w 1 /run/service/s6-svscan-log/fifo blocks if there's no process reading on /run/service/s6-svscan-log/fifo, but redirfd -wnb 1 /run/service/s6-svscan-log/fifo does not.

This trick with FIFOs can even be used to avoid potential race conditions in the one-time initialization script that runs in stage 2. If forked from init-stage1 right before executing s6-svscan, depending on the scheduler mood, this script may actually run a long way before s6-svscan is actually executed and running the initial services - and may do dangerous things, such as writing messages to the logging FIFO before there's a reader, and eating a SIGPIPE and dying without completing the initialization. To avoid that and be sure that s6-svscan really runs and initial services are really started before the stage 2 init script is allowed to continue, it is possible to redirect the child script's output (stdout and/or stderr) once again to the logging FIFO, but in the normal way without redirfd trickery, before it execs into the init-stage2 script. So, the child process blocks on the FIFO until a reader appears, while process 1 - which does not block - execs into s6-svscan and starts the logging service, which then opens the logging FIFO for reading and unblocks the child process, which then runs the initialization tasks with the guarantee that s6-svscan is running.

It really is simpler than it sounds. :-)

A working example

This whole page may sound very theoretical, dry, wordy, and hard to grasp without a live example to try things on; unfortunately, s6 cannot provide live examples without becoming system-specific.

However, the s6-linux-init package provides you with the s6-linux-init-maker command, which produces a set of working scripts, including a script that is suitable as /sbin/init, for you to study and edit. You can run the s6-linux-init-maker command even on non-Linux systems: it will produce scripts that do not work as is for another OS, but can still be used for study and as a basis for a working stage 1 script.