Re: fd_close() conceals errors

From: Jan Bramkamp <crest_at_rlwinm.de>
Date: Mon, 27 Apr 2020 14:19:38 +0200

On 27.04.20 13:30, Laurent Bercot wrote:

>> Hi Laurent. It's been a while. :)
>
>  Long time no see. :) We'd love to have you on #s6, if you can!
>
>
>> I'm updating my packages to catch up with skalibs, gcc warnings, etc.,
>> and fd_close() returning void is one of the changes that hit me.
>> Looking over past discussions, I agree that POSIX is a mess
>> here--nevertheless, given that close() is the way it is, I think it's
>> best for fd_close() to report errors other than EINTR to the caller.
>> What do you think?
>
>  Ha. This has been the subject of several discussions, and a long thought
> process for me. This change has probably taken the most thinking time by
> character in the diff.
>
>  The long and short of it is that close() is a destructor (it frees up
> resources that have previously been allocated), and with the years I've
> found the following rule to be really important:
>
>  *Destructors should never fail.*
>
>  Breaking that rule is an incitation to terrible programming, and can
> lead to absurd behaviour, a striking example of which is service
> management systems (read: systemd) that refuse to shut a machine down,
> and keep the machine looping in a dysfunctional state, when something
> goes wrong during the shutdown procedure.
>
>  What did it for me was to ask myself the question: what do you do if
> a destructor fails?
Report loudly and crash. Don't log success.
>
>  There is no good answer to that question. Do you consider the resource
> still allocated? You need to free it to keep going, but attempting to
> free it is obviously failing. Do you keep going and leak the resource?
> If not, how do you report failure? Say you manage to report the failure
> with a "resource cannot be freed" error code. What should the caller do?
> Same exact question. So you defer to the caller again, and the problem
> goes all the way back to main(), and then what should the program do?
> exit just because it could not free a resource? What if this program is
> your shutdown procedure?
Log the error, start an interruptable countdown giving the operator
chance to read the error message and enter a debugger, but reboot or
shutdown if nobody is looking at it. If the platform supports it record
the error or at least the fact that the last shutdown ran into an error.
> The only sane way to proceed is to keep trucking as if the destructor
> did not fail. You might leak a resource, but there's no way to free it
> and attempts to report it will just make you fail harder than if you
> ignore it. If you can't free a resource, it's not something you can
> handle, it's a system problem, and it's above your paygrade as a simple
> program that called a destructor.
>
>  And so, if the sanest thing for a caller to do is to ignore errors, it
> means that generating errors is a no-no for a destructor. They should not
> be allowed to fail, period.
>
>  So, yes, POSIX is terrible, and close() should return void in the first
> place. Note that at least on Linux, the fd is closed even when close()
> returns -1 EINTR! so in practice, at least on Linux it indeed never fails.
> (But even in that case the API is a problem, because you cannot assume
> that every OS will do that; if the fd is closed then you should
> return success; failure should mean that the fd still needs to be
> closed; attempting to close it again is only valid in single-threaded
> programs, because in multithreaded programs you could end up closing an
> unrelated fd that was opened by another thread in a race. Yes, that's
> the kind of mess you get yourself into when destructors fail.)
>
>  fd_close() catches EINTR just for safety and because skalibs is mostly
> used in single-threaded programs. EBADF is a programming error and should
> never happen. EIO and other potential error codes signal an underlying
> problem, it's best to ignore them here, they will be raised again next
> time there's a filesystem call.
>
>  fd_close() does the job of a safe wrapper, which is to hide infelicities
> of POSIX behaviour. And that includes providing a sane API, i.e. returning
> void. :)

Ignoring errors on close isn't good enough for reliably composable
programs. I agreee that POSIX semantics for close(2) are a mess, but
your assumption that close can't fail isn't true. Lets say a trival
program wants to append to a file so it opens the file, writes to the
end, closes the file. The network file systems like NFS and AFS flush
the caches on close for historic reasons. A simple if ( !close(fd) )
exit(1); at least admits there is a problem even it just passes the buck
to its parent. I can imagine few things worse in than *silient* data
corruption in an API.
Received on Mon Apr 27 2020 - 12:19:38 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:38:49 UTC