I have broken my Kubernetes cluster

Categories: Kubernetes

My Kubernetes cluster is down

My single-node baremetal cluster using kubeadm and on calico networking plugin is down (link to the article here). The machine won't startup anymore and there is very mysterius error from cephlib reporting that it was unable to start. Few days ago I was playing with rook (https://github.com/rook/rook). Rook uses ceph to give you easy to use persistent volume experience.

Everything was fine at this point until I started doing research on advanced Prometheus configuration. After having installed Prometheus helm chart, I have noticed that Prometheus is not scraping my docker metrics (cAdvisor metrics to be more precise). I found out later that I might have installed old docker version. It was version way before 17.xx.

As every smart person would do (not really...) I decided to try to upgrade docker "on the fly", while all Kubernetes components are still running on my machine. I've decided that I will perform upgrade on the fly by uninstalling and installing new version of docker while kubelet is still running.

Everything was pretty fine until I restarted the machine. Then I began having error from libceph failing to start. The laptop also refused to boot and was stuck at trying to startup libceph.

Lessons to learn

Despite this silly experiment, there are a few lessons o learn:

  1. Make sure to install docker-ce (version 17.03 is strongly recommended for Kubernetes 1.9), not docker. Docker package on most of the distros usually contain very old versions of docker. Look for docker-ce instead. I will also try replacing docker with CRI-O next time I install a new cluster.
  2. Never upgrade docker when your cluster is running
  3. Make sure to do cluster backup before playing with infrastructure
  4. Rook might still not be a good fit for production environments ;)
  5. Do not create baremetal clusters if you do not feel like reinstalling the system every time you break something
  6. Make sure to have a base-image of a system with pre-installed tools that you can use later to clone in case you want to restore or expand the cluster

Many sources point out CentOS as a better Linux Distro for Kubernetes. This will be also my next distro on my laptop-server.

At this point I have decided not to try to revive the machine and create a KVM virtualization layer on my laptop so that I don't need to reinstall OS again.
Therefore the next step for me is to set up KVM on top of CentOS, so that I won't need to physically reinstall the laptop next time something like this happens. Stay tuned for update.

See also

Share this post with your friends

comments powered by Disqus