Skip to content

Docker daemon dies on boot when cgroupsPath used #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ijc opened this issue Dec 8, 2017 · 13 comments
Closed

Docker daemon dies on boot when cgroupsPath used #23

ijc opened this issue Dec 8, 2017 · 13 comments

Comments

@ijc
Copy link
Contributor

ijc commented Dec 8, 2017

After #14 Docker does not start. Shortly after boot without taking any other action the state becomes:

(ns: getty) linuxkit-d6247f201b73:~# ctr t ls
TASK                                           PID    STATUS    
rngd                                           808    STOPPED
sshd                                           851    RUNNING
docker                                         547    STOPPED
getty                                          586    RUNNING
kubelet                                        644    RUNNING
kubernetes-docker-image-cache-common           678    STOPPED
kubernetes-docker-image-cache-control-plane    730    STOPPED
ntpd                                           771    RUNNING

The logs are uninteresting:

(ns: getty) linuxkit-d6247f201b73:~# cat /var/log/docker.err.log 
[WARN  tini (547)] Tini is not running as PID 1 and isn't registered as a child subreaper.
Zombie processes will not be re-parented to Tini, so zombie reaping won't work.
To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.
time="2017-12-08T14:55:34.404476201Z" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
time="2017-12-08T14:55:34.405395092Z" level=info msg="libcontainerd: new containerd process, pid: 591"
time="2017-12-08T14:55:36.515729941Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
time="2017-12-08T14:55:36.557961200Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
time="2017-12-08T14:55:36.558552846Z" level=info msg="Loading containers: start."
time="2017-12-08T14:55:36.717274716Z" level=warning msg="Running modprobe nf_nat failed with message: `ip: can't find device 'nf_nat'\nmodprobe: module nf_nat not found in modules.dep`, error: exit status 1"
time="2017-12-08T14:55:36.731672396Z" level=warning msg="Running modprobe xt_conntrack failed with message: `ip: can't find device 'xt_conntrack'\nmodprobe: module xt_conntrack not found in modules.dep`, error: exit status 1"
time="2017-12-08T14:55:37.065878091Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
time="2017-12-08T14:55:37.171766223Z" level=info msg="Loading containers: done."
time="2017-12-08T14:55:37.274945197Z" level=info msg="Docker daemon" commit=f4ffd25 graphdriver(s)=overlay2 version=17.10.0-ce
time="2017-12-08T14:55:37.275544195Z" level=info msg="Daemon has completed initialization"
time="2017-12-08T14:55:37.288354509Z" level=info msg="API listen on /var/run/docker.sock"

It seems that reverting #14 fixes things. I'll double check and raise a PR to revert while we sort this out.

/cc @justincormack. This was also mentioned in #11 (comment).

@ijc
Copy link
Contributor Author

ijc commented Dec 8, 2017

#14 says the change was based on https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources but I don't see any reference to podruntime nor systemreserved in the linked document. It doesn't seem to go into that sort of detail, it's about a kubelet flag that we don't use.

@ijc ijc changed the title Docker does not start Docker daemon dies on boot when cgroupsPath used Dec 8, 2017
@ijc
Copy link
Contributor Author

ijc commented Dec 8, 2017

Looks like removing the cgroupsPath from the docker and kubelet services and leaving the others in place is sufficient to fix.

@ijc
Copy link
Contributor Author

ijc commented Dec 13, 2017

I had thought this was correlated with kubeadm being run, but actually it is something to do with the image cache loading, if I don't run kubeadm then it still dies a little while after boot. However if I remove the image caches from the image then it seems kubeadm can be run and things come up ok.

I added -D --log-level=debug to the daemon command line producing: docker.txt. I don't see anything interesting in there.

@ijc
Copy link
Contributor Author

ijc commented Dec 13, 2017

It's a bit hard to be sure, but using strace it looks like the dockerd process is being sent a SIGKILL form somewhere -- I can't see where though.

@ijc
Copy link
Contributor Author

ijc commented Dec 13, 2017

Putting docker and kubernetes-docker-image-cache-{control-plane,common} into three separate cgroups seems to resolve the issue.

I wonder if between them they are blowing some sort of cgroup memory limit when they all share the same space and triggering something like an OOM? I'm seeing no sign of that in dmesg though.

@justincormack
Copy link
Member

thats really odd. The cgroups just create the groups, but no limits are applied. Hmm.

@ijc
Copy link
Contributor Author

ijc commented Dec 14, 2017

I wondered if maybe there was some unhelpful default (e.g. 50% of RAM) in the kernel (seems a little unlikely) or if something (like kubelet itself maybe?) was taking it upon itself to provide some caps to existing cgroups. Perhaps we need to be passing it one or more of the --cgroup-* options it has, which can do reservations for system cgroups and stuff, to tell it what is what?

@ijc
Copy link
Contributor Author

ijc commented Dec 14, 2017

It can't be kubelet because this happens even when that service is still sat in our kubelet.sh wrapper waiting for /etc/kubernetes/kubelet.conf to be populated by kubeadm.

@ijc
Copy link
Contributor Author

ijc commented Dec 15, 2017

I'm playing with auditd trying to get a handle on the process doing the killing.

I'm not 100% sure yet, but it looks like it is a runc process which is doing it (I'm now trying to trace fork+exec and friends to try and reconstruct the process tree). I wonder if when the image cache services exit it is killing everything in the same cgroup?

@ijc
Copy link
Contributor Author

ijc commented Dec 15, 2017

The process which is killing docker-init + dockerd and docker-containerd as well as all the subprocesses of the control plane image cache service is:

type=EXECVE msg=audit(1513345148.627:875): argc=11 a0="runc" a1="--root" a2="/run/containerd/runc/default" a3="--log" a4="/run/containerd/io.containerd.runtime.v1.linux/default/kubernetes-docker-image-cache-common/log.json" a
5="--log-format" a6="json" a7="kill" a8="--all" a9="kubernetes-docker-image-cache-common" a10="9"

which is:

runc --root /run/containerd/runc/default \
     --log /run/containerd/io.containerd.runtime.v1.linux/default/kubernetes-docker-image-cache-common/log.json \
     --log-format json \
     kill \
     --all kubernetes-docker-image-cache-common 9

It's parent is the containerd-shim for the common cache service.

So it does appear that tearing down the common cache service (which is the first to exit) is tearing down all the processes in that cgroup.

@ijc
Copy link
Contributor Author

ijc commented Dec 15, 2017

@justincormack is it notable that 662d3d4 sets cgroupsPath on things but doesn't use runtime.cgroups on the kubelet to ensure they exist (I thought you said something about needing this for some reason). Not sure it relates to this issue but noticed it while investigating.

@ijc
Copy link
Contributor Author

ijc commented Dec 15, 2017

runc kill --all ends up calling into a function signalAllProcess which has a comment:

// signalAllProcesses freezes then iterates over all the processes inside the
// manager's cgroups sending the signal s to them.
// If s is SIGKILL then it will wait for each process to exit.
// For all other signals it will check if the process is ready to report its
// exit status and only if it is will a wait be performed.

So it seems to be expected that this will nuke the whole cgroup.

@ijc
Copy link
Contributor Author

ijc commented Jan 4, 2018

containerd/containerd#1932 (comment) says:

Its "supported", as in it will place them in the same one, but there are side effects. Just don't do it. The default cgroups config for containerd is to place each container in its own cgroup, under a namespace. There is no reason to put containers in the same cgroup and many more things at the runc level expect separate cgroups as well as most other container systems.

So I guess #14 needs to be redone using sub-cgroups under a namespace. @justincormack will you look into that?

ijc pushed a commit to ijc/linuxkit-kubernetes that referenced this issue Jan 8, 2018
This replaces the intent of linuxkit#14 (which was reverted in linuxkit#24). Compared with
that:

- Use a separate (nested) cgroup for each component since having multiple
  containers in the same cgroup has caveats (see linuxkit#23).
- Tell kubelet about system and runtime cgroups which contain all the others.

Fixes linuxkit#23.

Signed-off-by: Ian Campbell <[email protected]>
ijc pushed a commit to ijc/linuxkit-kubernetes that referenced this issue Jan 8, 2018
This replaces the intent of linuxkit#14 (which was reverted in linuxkit#24). Compared with
that:

- Use a separate (nested) cgroup for each component since having multiple
  containers in the same cgroup has caveats (see linuxkit#23).
- Tell kubelet about system and runtime cgroups which contain all the others.

Fixes linuxkit#23.

Signed-off-by: Ian Campbell <[email protected]>
ijc pushed a commit to ijc/linuxkit-kubernetes that referenced this issue Jan 9, 2018
This replaces the intent of linuxkit#14 (which was reverted in linuxkit#24). Compared with
that:

- Use a separate (nested) cgroup for each component since having multiple
  containers in the same cgroup has caveats (see linuxkit#23).
- Tell kubelet about system and runtime cgroups which contain all the others.

Fixes linuxkit#23.

Signed-off-by: Ian Campbell <[email protected]>
@ijc ijc closed this as completed in #46 Jan 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants