We ran into a fun little bug this week at work that took a good while to track down to it’s source. Imagine this scenario:
We start to receive reports that Docker has restarted, causing containers running on the hosts to restart, sporadically across our development and testing environments. After some investigation, we tie this to when puppet has run on these servers. It’s not immediately apparent why, and successive puppet runs don’t cause the same behavior.
Eventually we realize that it was related to a change we made for our systemd-journald configuration, that was being pushed out during the problematic puppet runs.
Our RHEL7 servers have not been configured to maintain a persistent journal. That is, by default, the journal is written to /run/systemd/journald, and is refreshed (lost) with every reboot. We decided to configure the journal to maintain logs for several boots, and did so by setting it up in puppet and pushing out the change, complete with a notify to systemd-journald, to restart the service. This was pushed to dev, and shortly after, test.
However, despite the fact that we knew it was related to the journald change, we could not reliably cause it to happen. Converting a box to persistent journals and restarting journald wouldn’t immediately cause Docker to fall over - it would take a few minutes before the service died.
Then it got even weirder. We realized that no changes actually had to happen - we only needed to restart systemd-journald to cause the issues with Docker. And interestingly, we could get Docker to crash by sending any three Docker commands. One `docker ps`? Everything is fine. Two? No problem. Three? KA-BOOM!
After this behavior was finally identified (and it took a while - it’s to troubleshoot something when it only fails the *third time* you try it), some Googling lead us to a bug report already filed with Docker where Dan Walsh (@rhatdan) explained:
…when you run a unit file systemd hooks up stdout/stderr to the journal, if the journal goes away.
These sockets will get closed, and when you write to them you will get a sigpipe… …Golang [less than v 1.6] has some wacky code that ignores SIGPIPE 10 times and then dies, no matter what you do in your code.
There’s the three times-ish. STDOUT and STDERR are written to by Docker when you issue a Docker command, and three commands cause Docker to crash. And yes, I know my math adds up to nine, not ten. From what I can tell our automation was also calling to the Docker API during the time we were testing, which was why we were seeing three as the limit.
The good news is there appear to be a plethora of patches making their way into the world. A fix/workaround was added to Systemd by Lennart Pottering back in January, Golang 1.6 will not suffer from the issue, and Red Hat has apparently patched Docker 1.9 and will be pushing that out, hopefully, early in April.