HA Kubernetes Cluster Playground

📖 Overview

This playground provides a complete HA Kubernetes cluster with:

3 Control Plane Nodes: cplane-01, cplane-02, cplane-03
1 Worker Nodes: worker-01
IP Failover: Keepalived for moving the virtual IP address between control plane nodes

🛠️ Tools

kubectl (alias: k): Kubernetes cluster management and debugging
nerdctl: Docker-compatible CLI for containerd
krew: kubectl plugin manager for extending functionality

🔥 Testing High Availability and Failover

One of the key benefits of an HA cluster is its resilience to node failures. Here are several tests you can perform to verify your cluster's fault tolerance:

Test 1: Control Plane Node Failure

Figure out which control plane node is active

# On each control plane node
ip a | grep inet | grep vip

# If you see the VIP address, it means the node is active

Stop kubelet on the Active Control Plane Node

# On the active control plane node (e.g., cplane-02)
sudo systemctl stop kubelet

# From another node, verify the cluster still functions
kubectl get nodes
kubectl get pods --all-namespaces

# The failed node should show as "NotReady" (may take a few minutes)
# but the cluster should remain operational

Stop API Server Container

# On the same control plane node, stop the API server container
# List containers in the k8s.io namespace
nerdctl --namespace k8s.io ps

# Stop the API server containers
nerdctl --namespace k8s.io stop $(nerdctl --namespace k8s.io ps -a --format "{{json .}}" | jq -r --arg name "k8s://kube-system/kube-apiserver-$(hostname)" 'select(.Names == $name or .Names == $name+"/kube-apiserver") | .ID')

# Verify from another node that the API is still accessible
kubectl cluster-info
kubectl get componentstatuses

# Find the new active control plane node
ip a | grep inet | grep vip

Recovery

# Start kubelet service
sudo systemctl start kubelet

# Verify node returns to Ready state
kubectl get nodes

Test 2: Worker Node Failure

# Deploy a test application
kubectl create deployment podinfo --image=ghcr.io/stefanprodan/podinfo --port=9898

# Figure out which node the pod is running on
kubectl get pods -o wide

# Stop kubelet on that worker node
sudo systemctl stop kubelet

# Watch pods get rescheduled to other nodes (it may take a few minutes)
kubectl get pods -o wide --watch

# Pods should be automatically rescheduled to healthy worker nodes

🧪 Playgrounds

Happy learning! 🚀

📖 Overview

This playground provides a complete HA Kubernetes cluster with:

3 Control Plane Nodes: cplane-01, cplane-02, cplane-03
1 Worker Nodes: worker-01
IP Failover: Keepalived for moving the virtual IP address between control plane nodes

🛠️ Tools

kubectl (alias: k): Kubernetes cluster management and debugging
nerdctl: Docker-compatible CLI for containerd
krew: kubectl plugin manager for extending functionality

🔥 Testing High Availability and Failover

One of the key benefits of an HA cluster is its resilience to node failures. Here are several tests you can perform to verify your cluster's fault tolerance:

Test 1: Control Plane Node Failure

Figure out which control plane node is active

# On each control plane node
ip a | grep inet | grep vip

# If you see the VIP address, it means the node is active

Stop kubelet on the Active Control Plane Node

# On the active control plane node (e.g., cplane-02)
sudo systemctl stop kubelet

# From another node, verify the cluster still functions
kubectl get nodes
kubectl get pods --all-namespaces

# The failed node should show as "NotReady" (may take a few minutes)
# but the cluster should remain operational

Stop API Server Container

# On the same control plane node, stop the API server container
# List containers in the k8s.io namespace
nerdctl --namespace k8s.io ps

# Stop the API server containers
nerdctl --namespace k8s.io stop $(nerdctl --namespace k8s.io ps -a --format "{{json .}}" | jq -r --arg name "k8s://kube-system/kube-apiserver-$(hostname)" 'select(.Names == $name or .Names == $name+"/kube-apiserver") | .ID')

# Verify from another node that the API is still accessible
kubectl cluster-info
kubectl get componentstatuses

# Find the new active control plane node
ip a | grep inet | grep vip

Recovery

# Start kubelet service
sudo systemctl start kubelet

# Verify node returns to Ready state
kubectl get nodes

Test 2: Worker Node Failure

# Deploy a test application
kubectl create deployment podinfo --image=ghcr.io/stefanprodan/podinfo --port=9898

# Figure out which node the pod is running on
kubectl get pods -o wide

# Stop kubelet on that worker node
sudo systemctl stop kubelet

# Watch pods get rescheduled to other nodes (it may take a few minutes)
kubectl get pods -o wide --watch

# Pods should be automatically rescheduled to healthy worker nodes

🧪 Playgrounds

Happy learning! 🚀

HA Kubernetes Cluster Playground

A multi-node, highly available Kubernetes cluster provisioned with kubeadm.

📖 Overview

🛠️ Tools

🔥 Testing High Availability and Failover

Test 1: Control Plane Node Failure

Figure out which control plane node is active

Stop kubelet on the Active Control Plane Node

Stop API Server Container

Recovery

Test 2: Worker Node Failure

🧪 Playgrounds

Actions

HA Kubernetes Cluster Playground

A multi-node, highly available Kubernetes cluster provisioned with kubeadm.

📖 Overview

🛠️ Tools

🔥 Testing High Availability and Failover

Test 1: Control Plane Node Failure

Figure out which control plane node is active

Stop kubelet on the Active Control Plane Node

Stop API Server Container

Recovery

Test 2: Worker Node Failure

🧪 Playgrounds

HA Kubernetes Cluster Playground

A multi-node, highly available Kubernetes cluster provisioned with kubeadm.

📖 Overview

🛠️ Tools

🔥 Testing High Availability and Failover

Test 1: Control Plane Node Failure

Test 2: Worker Node Failure

🧩 Related Content

🧪 Playgrounds

Actions

HA Kubernetes Cluster Playground

A multi-node, highly available Kubernetes cluster provisioned with kubeadm.

📖 Overview

🛠️ Tools

🔥 Testing High Availability and Failover

Test 1: Control Plane Node Failure

Test 2: Worker Node Failure

🧩 Related Content

🧪 Playgrounds