Puzzle Block Game Application Is Gone, Recover It from etcd

Scenario

It's 2:00 AM. Your phone rings — the on-call alert is firing. Someone ran kubectl delete namespace prod by mistake and the entire production application is gone. The Puzzle Block Game — a browser-based puzzle game running in the prod namespace — has been wiped out along with its Deployment, ConfigMap, and NodePort Service.

Fortunately, your team follows backup best practices. An etcd snapshot was taken before the incident and synced to an S3 bucket. The on-call engineer has already downloaded it to the node.

The backup is available at: /home/laborant/prod-etcd-backup/snapshot.db

Task

Your job is to bring everything back before the business wakes up.

Perform a full etcd disaster recovery and restore the production application:

Restore the etcd snapshot from /home/laborant/prod-etcd-backup/snapshot.db into /var/lib/prod-etcd
Stop the API server and etcd safely, update the etcd manifest to point to the restored data, and bring the cluster back up
Restart the necessary components so the application is reachable on NodePort 32222

Backup location : /home/laborant/prod-etcd-backup/snapshot.db
Restore target  : /var/lib/prod-etcd

This cluster runs etcd v3.6. The snapshot restore subcommand was removed from etcdctl in v3.6 and moved to etcdutl. Use etcdutl snapshot restore — not etcdctl snapshot restore — or you will get Error: unknown flag: --data-dir.

Hint 1 — Restore the etcd Snapshot

Use etcdutl snapshot restore — not etcdctl — and point --data-dir to the target path. This is a pure file operation; it does not touch the running etcd instance.

sudo etcdutl snapshot restore \
  /home/laborant/prod-etcd-backup/snapshot.db \
  --data-dir=/var/lib/prod-etcd

Confirm the directory structure was created correctly:

ls -la /var/lib/prod-etcd/member/

A successful restore produces:

/var/lib/prod-etcd/member/wal/ — write-ahead log
/var/lib/prod-etcd/member/snap/db — the restored database file

Documentation

Restoring an etcd cluster

Hint 2 — Stop, Update, and Restart etcd

Before editing the manifest, stop kube-apiserver and etcd safely by moving their manifests out of /etc/kubernetes/manifests/ — this causes the kubelet to stop those static pods immediately.

cd /etc/kubernetes/manifests/
sudo mv etcd.yaml kube-apiserver.yaml /tmp

Edit /tmp/etcd.yaml and update the hostPath volume that mounts the data directory into the etcd pod — this is the only change required:

volumes:
- hostPath:
    path: /var/lib/prod-etcd   # ← update this
    type: DirectoryOrCreate
  name: etcd-data

Move both manifests back, then restart the kubelet to apply the change:

sudo mv /tmp/etcd.yaml /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
sudo systemctl restart kubelet

Wait for the cluster to come back before running any other kubectl commands:

kubectl get nodes

Documentation

Hint 3 — Restart CoreDNS and kube-proxy

Even after the cluster comes back and kubectl get all -n prod shows everything restored, curl http://localhost:32222 may still fail. This is expected — CoreDNS and kube-proxy both need to resync with the restored etcd state before traffic can flow correctly.

kubectl rollout restart deployment -n kube-system
kubectl rollout restart daemonset -n kube-system

Wait for both rollouts to finish, then verify everything is restored:

kubectl get namespace prod
kubectl get all -n prod
kubectl get configmap puzzle-block-config -n prod
kubectl get svc puzzle-block-game-deployment -n prod
curl http://localhost:32222

If pods are stuck in Pending or ContainerCreating, give them a minute — the scheduler and kubelet need to reconcile after the restore.

Documentation

Service NodePort

Puzzle Block Game Application Is Gone, Recover It from etcd

Scenario

Task

💡 Test Cases