Challenge, Hard,  on  Kubernetes

Puzzle Block Game Application Is Gone, Recover It from etcd

Scenario

It's 2:00 AM. Your phone rings — the on-call alert is firing. Someone ran kubectl delete namespace prod by mistake and the entire production application is gone. The Puzzle Block Game — a browser-based puzzle game running in the prod namespace — has been wiped out along with its Deployment, ConfigMap, and NodePort Service.

Fortunately, your team follows backup best practices. An etcd snapshot was taken before the incident and synced to an S3 bucket. The on-call engineer has already downloaded it to the node.

The backup is available at: /home/laborant/prod-etcd-backup/snapshot.db


Task

Your job is to bring everything back before the business wakes up.

Perform a full etcd disaster recovery and restore the production application:

  1. Restore the etcd snapshot from /home/laborant/prod-etcd-backup/snapshot.db into /var/lib/prod-etcd
  2. Stop the API server and etcd safely, update the etcd manifest to point to the restored data, and bring the cluster back up
  3. Restart the necessary components so the application is reachable on NodePort 32222
Backup location : /home/laborant/prod-etcd-backup/snapshot.db
Restore target  : /var/lib/prod-etcd

This cluster runs etcd v3.6. The snapshot restore subcommand was removed from etcdctl in v3.6 and moved to etcdutl. Use etcdutl snapshot restore — not etcdctl snapshot restore — or you will get Error: unknown flag: --data-dir.


Hint 1 — Restore the Snapshot

Use etcdutl snapshot restore and point --data-dir to the target path.

sudo etcdutl snapshot restore \
  /home/laborant/prod-etcd-backup/snapshot.db \
  --data-dir=/var/lib/prod-etcd

# confirm the directory structure
ls -la /var/lib/prod-etcd/member/

A successful restore produces:

  • /var/lib/prod-etcd/member/wal/ — write-ahead log
  • /var/lib/prod-etcd/member/snap/db — the restored database file

Documentation

Hint 2 — Update the etcd Manifest

Edit the etcd static pod manifest:

sudo vi /etc/kubernetes/manifests/etcd.yaml

Find and update two places:

1. The --data-dir flag in the command args:

- --data-dir=/var/lib/prod-etcd

2. The hostPath volume that mounts the data directory into the pod:

volumes:
- hostPath:
    path: /var/lib/prod-etcd   # ← update this
    type: DirectoryOrCreate
  name: etcd-data

Then restart the kubelet to apply immediately:

sudo systemctl restart kubelet

Wait for etcd and apiserver to recover before running any kubectl commands:

# watch etcd pod come back
kubectl get pod -n kube-system -l component=etcd -w

# confirm cluster is responding
kubectl get nodes

Documentation

Hint 3 — Verify the Application is Restored
# check prod namespace is back
kubectl get namespace prod

# check deployment and pods
kubectl get all -n prod

# check configmap
kubectl get configmap puzzle-block-config -n prod

# check the service and nodeport
kubectl get svc puzzle-block-game-deployment -n prod

# hit the app from the node
curl http://localhost:32222

If pods are in Pending or ContainerCreating, give them a minute — the scheduler and kubelet need to reconcile after the restore.


Test Cases