Docker daemon dies on boot when cgroupsPath used #23

ijc · 2017-12-08T15:01:19Z

After #14 Docker does not start. Shortly after boot without taking any other action the state becomes:

(ns: getty) linuxkit-d6247f201b73:~# ctr t ls
TASK                                           PID    STATUS    
rngd                                           808    STOPPED
sshd                                           851    RUNNING
docker                                         547    STOPPED
getty                                          586    RUNNING
kubelet                                        644    RUNNING
kubernetes-docker-image-cache-common           678    STOPPED
kubernetes-docker-image-cache-control-plane    730    STOPPED
ntpd                                           771    RUNNING

The logs are uninteresting:

(ns: getty) linuxkit-d6247f201b73:~# cat /var/log/docker.err.log 
[WARN  tini (547)] Tini is not running as PID 1 and isn't registered as a child subreaper.
Zombie processes will not be re-parented to Tini, so zombie reaping won't work.
To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.
time="2017-12-08T14:55:34.404476201Z" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
time="2017-12-08T14:55:34.405395092Z" level=info msg="libcontainerd: new containerd process, pid: 591"
time="2017-12-08T14:55:36.515729941Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
time="2017-12-08T14:55:36.557961200Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
time="2017-12-08T14:55:36.558552846Z" level=info msg="Loading containers: start."
time="2017-12-08T14:55:36.717274716Z" level=warning msg="Running modprobe nf_nat failed with message: `ip: can't find device 'nf_nat'\nmodprobe: module nf_nat not found in modules.dep`, error: exit status 1"
time="2017-12-08T14:55:36.731672396Z" level=warning msg="Running modprobe xt_conntrack failed with message: `ip: can't find device 'xt_conntrack'\nmodprobe: module xt_conntrack not found in modules.dep`, error: exit status 1"
time="2017-12-08T14:55:37.065878091Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
time="2017-12-08T14:55:37.171766223Z" level=info msg="Loading containers: done."
time="2017-12-08T14:55:37.274945197Z" level=info msg="Docker daemon" commit=f4ffd25 graphdriver(s)=overlay2 version=17.10.0-ce
time="2017-12-08T14:55:37.275544195Z" level=info msg="Daemon has completed initialization"
time="2017-12-08T14:55:37.288354509Z" level=info msg="API listen on /var/run/docker.sock"

It seems that reverting #14 fixes things. I'll double check and raise a PR to revert while we sort this out.

/cc @justincormack. This was also mentioned in #11 (comment).

The text was updated successfully, but these errors were encountered:

ijc · 2017-12-08T15:03:57Z

#14 says the change was based on https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources but I don't see any reference to podruntime nor systemreserved in the linked document. It doesn't seem to go into that sort of detail, it's about a kubelet flag that we don't use.

ijc · 2017-12-08T15:45:18Z

Looks like removing the cgroupsPath from the docker and kubelet services and leaving the others in place is sufficient to fix.

ijc · 2017-12-13T16:22:11Z

I had thought this was correlated with kubeadm being run, but actually it is something to do with the image cache loading, if I don't run kubeadm then it still dies a little while after boot. However if I remove the image caches from the image then it seems kubeadm can be run and things come up ok.

I added -D --log-level=debug to the daemon command line producing: docker.txt. I don't see anything interesting in there.

ijc · 2017-12-13T16:59:03Z

It's a bit hard to be sure, but using strace it looks like the dockerd process is being sent a SIGKILL form somewhere -- I can't see where though.

ijc · 2017-12-13T17:03:19Z

Putting docker and kubernetes-docker-image-cache-{control-plane,common} into three separate cgroups seems to resolve the issue.

I wonder if between them they are blowing some sort of cgroup memory limit when they all share the same space and triggering something like an OOM? I'm seeing no sign of that in dmesg though.

justincormack · 2017-12-13T17:36:58Z

thats really odd. The cgroups just create the groups, but no limits are applied. Hmm.

ijc · 2017-12-14T10:41:00Z

I wondered if maybe there was some unhelpful default (e.g. 50% of RAM) in the kernel (seems a little unlikely) or if something (like kubelet itself maybe?) was taking it upon itself to provide some caps to existing cgroups. Perhaps we need to be passing it one or more of the --cgroup-* options it has, which can do reservations for system cgroups and stuff, to tell it what is what?

ijc · 2017-12-14T16:36:09Z

It can't be kubelet because this happens even when that service is still sat in our kubelet.sh wrapper waiting for /etc/kubernetes/kubelet.conf to be populated by kubeadm.

ijc · 2017-12-15T13:25:19Z

I'm playing with auditd trying to get a handle on the process doing the killing.

I'm not 100% sure yet, but it looks like it is a runc process which is doing it (I'm now trying to trace fork+exec and friends to try and reconstruct the process tree). I wonder if when the image cache services exit it is killing everything in the same cgroup?

ijc · 2017-12-15T13:50:06Z

The process which is killing docker-init + dockerd and docker-containerd as well as all the subprocesses of the control plane image cache service is:

type=EXECVE msg=audit(1513345148.627:875): argc=11 a0="runc" a1="--root" a2="/run/containerd/runc/default" a3="--log" a4="/run/containerd/io.containerd.runtime.v1.linux/default/kubernetes-docker-image-cache-common/log.json" a
5="--log-format" a6="json" a7="kill" a8="--all" a9="kubernetes-docker-image-cache-common" a10="9"

which is:

runc --root /run/containerd/runc/default \
     --log /run/containerd/io.containerd.runtime.v1.linux/default/kubernetes-docker-image-cache-common/log.json \
     --log-format json \
     kill \
     --all kubernetes-docker-image-cache-common 9

It's parent is the containerd-shim for the common cache service.

So it does appear that tearing down the common cache service (which is the first to exit) is tearing down all the processes in that cgroup.

ijc · 2017-12-15T13:55:45Z

@justincormack is it notable that 662d3d4 sets cgroupsPath on things but doesn't use runtime.cgroups on the kubelet to ensure they exist (I thought you said something about needing this for some reason). Not sure it relates to this issue but noticed it while investigating.

ijc · 2017-12-15T14:03:29Z

runc kill --all ends up calling into a function signalAllProcess which has a comment:

// signalAllProcesses freezes then iterates over all the processes inside the
// manager's cgroups sending the signal s to them.
// If s is SIGKILL then it will wait for each process to exit.
// For all other signals it will check if the process is ready to report its
// exit status and only if it is will a wait be performed.

So it seems to be expected that this will nuke the whole cgroup.

ijc · 2018-01-04T16:24:36Z

containerd/containerd#1932 (comment) says:

Its "supported", as in it will place them in the same one, but there are side effects. Just don't do it. The default cgroups config for containerd is to place each container in its own cgroup, under a namespace. There is no reason to put containers in the same cgroup and many more things at the runc level expect separate cgroups as well as most other container systems.

So I guess #14 needs to be redone using sub-cgroups under a namespace. @justincormack will you look into that?

This replaces the intent of linuxkit#14 (which was reverted in linuxkit#24). Compared with that: - Use a separate (nested) cgroup for each component since having multiple containers in the same cgroup has caveats (see linuxkit#23). - Tell kubelet about system and runtime cgroups which contain all the others. Fixes linuxkit#23. Signed-off-by: Ian Campbell <[email protected]>

ijc mentioned this issue Dec 8, 2017

Put containers in the cgroups that Kubernetes recommends #14

Merged

ijc mentioned this issue Dec 8, 2017

Revert "Put containers in the cgroups that Kubernetes recommends" #24

Merged

ijc changed the title ~~Docker does not start~~ Docker daemon dies on boot when cgroupsPath used Dec 8, 2017

This was referenced Dec 18, 2017

Is there a uniqueness requirement for cgroupsPath? opencontainers/runtime-spec#944

Closed

containers with shared cgroupsPath all killed if one exits containerd/containerd#1932

Closed

w9n mentioned this issue Dec 21, 2017

all cadvisor metrics have id="/" #38

Closed

ijc mentioned this issue Jan 8, 2018

Put system and runtime containers into cgroups. #46

Merged

ijc closed this as completed in #46 Jan 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker daemon dies on boot when cgroupsPath used #23

Docker daemon dies on boot when cgroupsPath used #23

ijc commented Dec 8, 2017

ijc commented Dec 8, 2017

ijc commented Dec 8, 2017

ijc commented Dec 13, 2017

ijc commented Dec 13, 2017

ijc commented Dec 13, 2017

justincormack commented Dec 13, 2017

ijc commented Dec 14, 2017

ijc commented Dec 14, 2017

ijc commented Dec 15, 2017

ijc commented Dec 15, 2017

ijc commented Dec 15, 2017

ijc commented Dec 15, 2017

ijc commented Jan 4, 2018

Docker daemon dies on boot when cgroupsPath used #23

Docker daemon dies on boot when cgroupsPath used #23

Comments

ijc commented Dec 8, 2017

ijc commented Dec 8, 2017

ijc commented Dec 8, 2017

ijc commented Dec 13, 2017

ijc commented Dec 13, 2017

ijc commented Dec 13, 2017

justincormack commented Dec 13, 2017

ijc commented Dec 14, 2017

ijc commented Dec 14, 2017

ijc commented Dec 15, 2017

ijc commented Dec 15, 2017

ijc commented Dec 15, 2017

ijc commented Dec 15, 2017

ijc commented Jan 4, 2018