Recovering from a major etcd failure

Etcd defines a "disastrous" failure as more than (N-1)/2 members being lost "permanently", in which N signifies the number of cluster members. In order to recover from this type of failure, you will need to essentially create a new etcd cluster.

The following steps presume a 3 node cluster in which m1, m2, and m3 are 3 Kubernetes masters running etcd as static pods. They encompass what I did in order to restore the etcd portion of one of my Kubernetes clusters.

  1. Identify the master that is going to be the progenitor of your new cluster. In our case, this will be m1.
  2. Stop etcd on all masters, even m1.
    1. Do this by moving the etcd manifest out of the /etc/kubernetes/manifests directory
    2. mv /etc/kubernetes/manifests/etcd.yaml /root/etcd.yaml
  3. Update the etcd manifest on m1 to force it to create a new cluster
    1. Add the --force-new-cluster flag to the command in the manifest
  4. Start etcd on m1
    1. mv /root/etcd.yaml /etc/kubernetes/manifests/etcd.yaml
  5. Verify that the container is started and OK
    1. docker container ls -a | grep etcd
    2. docker logs <container_id>
  6. Exec into the container on m1 to add a new member to the cluster
    1. docker exec -it <container_id> /bin/sh
    2.  # Check on the existing members. It should just be m1 right now. Replace m1 with the FQDN of your etcd endpoint
      
       etcdctl --key-file=/etc/kubernetes/pki/etcd/server.key --cert-file=/etc/kubernetes/pki/etcd/server.crt --endpoints=https://m1:2379 --ca-file=/etc/kubernetes/pki/etcd/ca.crt member list
        
       # Add the first new member. The first argument to "add" is the name of the cluster member. This isn't terribly important, but make sure you can use it to distinguish cluster members. The second argument is the IP and port for the peer address.
        
       etcdctl --key-file=/etc/kubernetes/pki/etcd/server.key --cert-file=/etc/kubernetes/pki/etcd/server.crt --endpoints=https://m1:2379 --ca-file=/etc/kubernetes/pki/etcd/ca.crt member add m2 https://10.253.5.18:2380
      
  7. After adding back m2, your etcd cluster will be unavailable until you start etcd on m2 since a quorum needs to be established.
    1. SSH to m2 and remove the etcd data from the previous cluster
      • rm -rf /var/lib/etcd
    2. Ensure m2's etcd manifest has only m1 and m2 in the --initial-cluster flag
      • --initial-cluster=m2=https://10.253.5.18:2380,m1=https://10.253.5.17:2380
      • Also ensure that the --initial-cluster-state=existing flag is set
      • You'll get an error if the number of nodes specified in initial-cluster is more than the actual number of nodes in the cluster.
    3. Start etcd on m2 using the command from Step 4
    4. On m1, run the "member list" command from above to ensure that m2 joined successfully.
      • If m2 hasn't joined yet and participated in a leader election, you'll get an error saying m1 has no leader.
  8. Now that m2 is added, we need to add m3 back in.
    1. Add the member
      • Repeat the steps in 6 above, but update the member name and peer address to reflect that of m3
    2. Start etcd on m3
      • Follow the steps from 7 above, but update the --initial-cluster flag to also include m3 now
  9. Verify on m1 that the "member list" command from step 6 now shows all 3 members.
    • Everything should be OK now!

Works Cited: