Tutorial  on  KubernetesContainers

Docker Containers vs. Kubernetes Pods - Taking a Deeper Look

A container could have become a lightweight VM replacement. However, the most widely used form of containers, popularized by Docker and later standardized by OCI, encourages you to have just one process service per container. Such an approach has a few significant upsides - increased isolation, simplified horizontal scaling, higher reusability, etc. However, this design also has a major drawback - in the real world, virtual machines rarely run just one service. Thus, the container abstraction might often be too limited for a fully-featured VM replacement.

Virtual Machine visualized.

While Docker were trying to offer workarounds to create multi-service containers, Kubernetes made a bolder step and chose as its smallest deployable unit not a single but a group of cohesive containers, called a Pod.

For engineers with prior VM or bare-metal experience, it should be relatively easy to grasp the idea of Pods, or so it may seem... 🙈

One of the first things you learn when beginning working with Kubernetes is that each Pod is assigned a unique IP address and a hostname. Furthermore, containers within a Pod can communicate with each other via localhost. Thus, it quickly becomes clear that a Pod resembles a server in miniature.

After a while, though, you realize that every container in a Pod gets an isolated filesystem and that from inside one container, you don't see files and processes of the other containers of the same Pod. So, maybe a Pod is not a tiny little server but just a group of containers with shared network devices?

But then you learn that containers in one Pod can communicate via shared memory and other typical Linux IPC means! So, probably the network namespace is not the only shared thing...

That last finding was the final straw for me, and I decided to deep dive and see with my own eyes:

  • How Pods are implemented under the hood.
  • What the actual difference between a Pod and a Container is.
  • What it would take to create a Pod using standard Docker commands.

Sounds interesting? Then join me on the journey! At the very least, it may help you solidify your Linux, Docker, and Kubernetes skills.

The OCI Runtime Spec doesn't limit container implementations to only Linux containers, i.e., the ones implemented with namespaces and cgroups. However, unless otherwise is stated explicitly, the word container in this article refers to this rather traditional form.

Examining a container

Let's start our guinea-pig container:

docker run --name foo --rm -d --memory='512MB' --cpus='0.5' nginx:alpine

Inspecting container's namespaces

Firstly, it'd be interesting to see what isolation primitives were created when the container started.

Here is how you can find the container's namespaces:

NGINX_PID=$(pgrep nginx | sort | head -n 1)

sudo lsns -p ${NGINX_PID}
        NS TYPE   NPROCS   PID USER COMMAND
...
4026532253 mnt         3  1269 root nginx: master process nginx -g daemon off;
4026532254 uts         3  1269 root nginx: master process nginx -g daemon off;
4026532255 ipc         3  1269 root nginx: master process nginx -g daemon off;
4026532256 pid         3  1269 root nginx: master process nginx -g daemon off;
4026532258 net         3  1269 root nginx: master process nginx -g daemon off;
4026532319 cgroup      3  1269 root nginx: master process nginx -g daemon off;

The namespaces used to isolate our nginx container are:

  • mnt (Mount) - the container has an isolated mount table.
  • uts (UNIX Time-Sharing) - the container is able to have its own hostname and domain name.
  • ipc (Interprocess Communication) - processes inside the container can communicate via system-level IPC only to processes inside the same container.
  • pid (Process ID) - processes inside the container are only able to see other processes inside the same container or inside the same pid namespace.
  • net (Network) - the container gets its own set of network devices, IP protocol stacks, port numbers, etc.
  • cgroup (Cgroup) - the container has its own virtualized view of the cgroup filesystem (not to be confused with cgroup mechanism itself).

Notice how the User namespace wasn't used! The OCI Runtime Spec mentions the user namespace support. However, while Docker can use this namespace for its containers, it doesn't do it by default due to the inherent limitations and extra operational complexity that it may add. Thus, the root user in a container is likely the root user from your host system. Beware!

Level up your Server Side game — Join 9,000 engineers who receive insightful learning materials straight to their inbox

Another namespace on the list that requires a special callout is cgroup. It took me a while to understand that the cgroup namespace is not the same as the cgroups mechanism. Cgroup namespace just gives a container an isolated view of the cgroup pseudo-filesystem (which is discussed below).

Inspecting container's cgroups

Linux namespaces make processes inside a container think they run on a dedicated machine. However, not seeing any neighbors doesn't mean being fully protected from them. Some hungry neighbors can accidentally consume an unfair share of the host's resources.

Cgroups to the rescue!

The cgroup limits for a given process can be checked by examining its node in the cgroup pseudo-filesystem (cgroupfs) that is usually mounted at /sys/fs/cgroup. But first, we need to figure out the path to the cgroupfs subtree for the process of interest:

sudo systemd-cgls --no-pager
Control group /:
-.slice
...
└─system.slice
  ...
  ├─docker-866191e4377b052886c3b85fc771d9825ebf2be06f84d0bea53bc39425f753b6.scope …
  │ ├─1269 nginx: master process nginx -g daemon off;
  │ └─1314 nginx: worker process
  ...

Then list the cgroupfs subtree:

ls -l /sys/fs/cgroup/system.slice/docker-$(docker ps --no-trunc -qf name=foo).scope/
...
-rw-r--r-- 1 root root 0 Sep 27 11:12 cpu.max
-r--r--r-- 1 root root 0 Sep 27 11:12 cpu.stat
-rw-r--r-- 1 root root 0 Sep 27 11:12 cpu.weight
...
-rw-r--r-- 1 root root 0 Sep 27 11:51 io.max
-r--r--r-- 1 root root 0 Sep 27 11:12 io.stat
...
-rw-r--r-- 1 root root 0 Sep 27 11:12 memory.high
-rw-r--r-- 1 root root 0 Sep 27 11:12 memory.low
-rw-r--r-- 1 root root 0 Sep 27 11:12 memory.max
-rw-r--r-- 1 root root 0 Sep 27 11:12 memory.min
...
-r--r--r-- 1 root root 0 Sep 27 11:42 pids.current
-r--r--r-- 1 root root 0 Sep 27 11:51 pids.events
-rw-r--r-- 1 root root 0 Sep 27 11:12 pids.max

And to see the memory limit in particular, one needs to read the value in the memory.max file:

cat /sys/fs/cgroup/system.slice/docker-$(docker ps --no-trunc -qf name=foo).scope/memory.max
536870912 # Exactly 512MB that were requested at the container start.

Interesting that starting a container without explicitly setting any resource limits configures a cgroup slice for it anyway. I haven't really checked, but my guess is that while CPU and RAM consumption is unrestricted by default, cgroups might be used to limit some other resource consumption and device access (e.g., /dev/sda and /dev/mem) from inside a container.

Here is how the container can be visualized based on the above findings:

Docker container visualized.

Examining a Pod

Now, let's take a look at Kubernetes Pods. To keep the Containers to Pods comparison fair, the Pod examination will be done on a Kubernetes cluster that uses the same underlying container runtime as Docker does - containerd/runc.

Much like with containers, the implementation of Pods can vary. For instance, when Kata Containers are used as the CRI runtime (via setting the runtimeClassName attribute on the Pod spec), Pods become true virtual machines! Expectedly, VM-based Pods will differ in implementation and capabilities from Pods implemented with traditional Linux containers.

The following two-container Pod will be examined:

pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: foo
spec:
  containers:
    - name: app
      image: nginx:alpine
      ports:
        - containerPort: 80
      resources:
        limits:
          memory: "256Mi"
    - name: sidecar
      image: curlimages/curl:8.3.0
      command: ["/bin/sleep", "3650d"]
      resources:
        limits:
          memory: "128Mi"

You can start the Pod with the following command:

kubectl apply -f pod.yaml

Inspecting Pod's containers

The Pod inspection should be done on the Kubernetes cluster node where this Pod is running. Using the kube-01 tab on the right, let's try finding the Pod's processes:

ps auxf
 PID  USER      ELAPSED  CMD
...
1748  root        01:16  /var/lib/rancher/k3s/data/ab2055bc72380bad965b219e8
1769  65535       01:15   \_ /pause
1801  root        01:15   \_ nginx: master process nginx -g daemon off;
1846  message+    01:15   |   \_ nginx: worker process
1847  message+    01:15   |   \_ nginx: worker process
1859  _apt        01:15   \_ /bin/sleep 3650d

Based on the similarity of the uptime and the common parent, the above three (top-level) processes were likely created during the Pod startup. That's interesting because in the manifest, only two containers, nginx and sleep, were requested.

It's possible to cross-check the above finding using the containerd command-line client called ctr:

sudo ctr --namespace=k8s.io containers ls
CONTAINER      IMAGE                                 RUNTIME
1c0d4c94188aa  docker.io/library/nginx:alpine        io.containerd.runc.v2
3f2b45521b479  docker.io/curlimages/curl:8.3.0       io.containerd.runc.v2
fe99217fab7c5  docker.io/rancher/mirrored-pause:3.6  io.containerd.runc.v2
...

Indeed, three new containers were created by the Kubernetes container runtime - nginx, sleep, and pause. At the same time, crictl - another command-line client compatible with any Kubernetes CRI runtime - shows just two containers:

sudo crictl ps
CONTAINER      IMAGE          ...  NAME      POD ID         POD
d74ad720df223  ead0a4a53df89       coredns   bf8d6b16f6c10  coredns-6799fbcd5-6xksq
3f2b45521b479  99f2927cb384d       sidecar   fe99217fab7c5  foo
1c0d4c94188aa  433dbc17191a7       app       fe99217fab7c5  foo

But notice how the POD ID field above matches the pause container ID from the ctr's output! Well, seems like the Pod has an auxiliary container. So, what is it for?

I'm not aware of any OCI Runtime Spec equivalent for Pods. So, when I'm not satisfied with the information provided by the Kubernetes API spec, I usually go directly to the Kubernetes Container Runtime Interface (CRI) protobuf file:

// kubelet expects any compatible container runtime
// to implement the following gRPC methods:

service RuntimeService {
    ...
    rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse) {}
    rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}
    rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse) {}
    rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {}
    rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {}

    rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {}
    rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {}
    rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {}
    rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {}
    rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {}
    rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {}
    rpc UpdateContainerResources(UpdateContainerResourcesRequest) returns (UpdateContainerResourcesResponse) {}
    rpc ReopenContainerLog(ReopenContainerLogRequest) returns (ReopenContainerLogResponse) {}

    // ...
}

message CreateContainerRequest {
    // ID of the PodSandbox in which the container should be created.
    string pod_sandbox_id = 1;
    // Config of the container.
    ContainerConfig config = 2;
    // Config of the PodSandbox. This is the same config that was passed
    // to RunPodSandboxRequest to create the PodSandbox. It is passed again
    // here just for easy reference. The PodSandboxConfig is immutable and
    // remains the same throughout the lifetime of the Pod.
    PodSandboxConfig sandbox_config = 3;
}

So, Pods are actually defined in terms of sandboxes and containers that can be started in these sandboxes. The sandbox thingy manages some common for all Pod's containers resources, and the pause container is started during the RunPodSandbox() call. A little bit of Internet searching reveals that this container has just an idling process inside.

Inspecting Pod's namespaces

Here is what the (relevant) namespaces look like on the cluster node:

sudo lsns
        NS TYPE   NPROCS    PID   USER     COMMAND
...
4026532333 net         5   1769   65535    /pause
4026532397 mnt         1   1769   65535    /pause
4026532398 uts         5   1769   65535    /pause
4026532399 ipc         5   1769   65535    /pause
4026532400 pid         1   1769   65535    /pause
4026532401 mnt         3   1801   root     nginx: master process nginx -g daemon off;
4026532402 pid         3   1801   root     nginx: master process nginx -g daemon off;
4026532403 cgroup      3   1801   root     nginx: master process nginx -g daemon off;
4026532404 mnt         1   1859   _apt     /bin/sleep 3650d
4026532405 pid         1   1859   _apt     /bin/sleep 3650d
4026532406 cgroup      1   1859   _apt     /bin/sleep 3650d

Much like the Docker container in the first part, the pause container gets five namespaces - net, mnt, uts, ipc, and pid. But apparently, nginx and sleep containers get just by three namespaces - mnt, pid, and cgroup. How come?

Turns out, lsns might not be the best tool to check process' namespaces. Instead, to check the namespaces used by a certain process, the /proc/${PID}/ns path can be referred:

# 1801 is the PID of the nginx container
sudo ls -l /proc/1801/ns
...
lrwxrwxrwx 1 root root 0 sep 28 12:00 cgroup -> 'cgroup:[4026532403]'
lrwxrwxrwx 1 root root 0 Sep 28 12:00 ipc -> 'ipc:[4026532399]'
lrwxrwxrwx 1 root root 0 Sep 28 12:00 mnt -> 'mnt:[4026532401]'
lrwxrwxrwx 1 root root 0 Sep 28 12:00 net -> 'net:[4026532333]'
lrwxrwxrwx 1 root root 0 Sep 28 12:00 pid -> 'pid:[4026532402]'
lrwxrwxrwx 1 root root 0 Sep 28 12:00 uts -> 'uts:[4026532398]'
...
# 1859 is the PID of the sleep container
sudo ls -l /proc/1859/ns
...
lrwxrwxrwx 1 _apt messagebus 0 Sep 28 12:00 cgroup -> 'cgroup:[4026532406]'
lrwxrwxrwx 1 _apt messagebus 0 Sep 28 12:00 ipc -> 'ipc:[4026532399]'
lrwxrwxrwx 1 _apt messagebus 0 Sep 28 12:00 mnt -> 'mnt:[4026532404]'
lrwxrwxrwx 1 _apt messagebus 0 Sep 28 12:00 net -> 'net:[4026532333]'
lrwxrwxrwx 1 _apt messagebus 0 Sep 28 12:00 pid -> 'pid:[4026532405]'
lrwxrwxrwx 1 _apt messagebus 0 Sep 28 12:00 uts -> 'uts:[4026532398]'
...

While it might be tricky to notice, the nginx and sleep containers actually reuse the net, uts, and ipc namespaces of the pause container!

Again, this can be cross-checked with crictl:

# 1c0d4c94188aa is the ID of the nginx container
sudo crictl inspect 1c0d4c94188aa | jq .info.runtimeSpec.linux.namespaces
[
  {
    "type": "pid"
  },
  {
    "type": "ipc",
    "path": "/proc/1769/ns/ipc"
  },
  {
    "type": "uts",
    "path": "/proc/1769/ns/uts"
  },
  {
    "type": "mount"
  },
  {
    "type": "network",
    "path": "/proc/1769/ns/net"
  },
  {
    "type": "cgroup"
  }
]
# 3f2b45521b479 is the ID of the sleep container
sudo crictl inspect 3f2b45521b479 | jq .info.runtimeSpec.linux.namespaces
[
  {
    "type": "pid"
  },
  {
    "type": "ipc",
    "path": "/proc/1769/ns/ipc"
  },
  {
    "type": "uts",
    "path": "/proc/1769/ns/uts"
  },
  {
    "type": "mount"
  },
  {
    "type": "network",
    "path": "/proc/1769/ns/net"
  },
  {
    "type": "cgroup"
  }
]

I think the above findings perfectly explain the ability of containers in the same Pod:

  • To talk to each other
    • via localhost and/or
    • using IPC means (shared memory, message queues, etc.)
  • To have a shared domain and hostname.

However, after seeing how all these namespaces are freely reused between containers, I started to suspect that the default boundaries can be shacked. And indeed, a more thorough read of the Pod API spec showed that with the shareProcessNamespace flag set to true Pod's containers will have four common namespaces instead of the default three. But there was a more shocking finding - hostIPC, hostNetwork, and hostPID flags can make the containers use the corresponding host's namespaces!

Interesting that the CRI API spec seems to be even more flexible. At least syntactically, it allows scoping the net, pid, and ipc namespaces to either CONTAINER, POD, or NODE. So, hypothetically, a Pod where containers cannot talk to each other via localhost can be constructed 🙈

Inspecting Pod's cgroups

Ok, what's up with Pod's cgroups? systemd-cgls can nicely visualize the cgroups hierarchy:

sudo systemd-cgls --no-pager
Control group /:
-.slice
...
└─kubepods.slice
  └─kubepods-burstable.slice
    ...
    └─kubepods-burstable-pode88e5109_51f9_4f39_8d99_ada6fb281137.slice
      ├─cri-containerd-1c0d4c94188aa402db8751db1301de1d3adcc6739ee2ca78c6738273ee8251a7.scope …
      │ ├─1801 nginx: master process nginx -g daemon off;
      │ ├─1846 nginx: worker process
      │ └─1847 nginx: worker process
      ├─cri-containerd-fe99217fab7c597c40f10b8086dfa2978394e9942b53d66569441b5b7d7d4ea0.scope …
      │ └─1769 /pause
      └─cri-containerd-3f2b45521b479bc1c07a8eebf9409e244d5fbf9eb431156d250a37f492599604.scope …
        └─1859 /bin/sleep 3650d

Seems like the Pod itself gets a parent node, and every container can be tweaked separately as well. This matches my expectations based on the fact that in the Pod manifest, resource limits can be set for every container in the Pod individually.

At this moment, a Pod in my head looks something like this:

Kubernetes Pod visualized.

Implementing Pods with Docker

If a Pod under the hood is implemented as a bunch of semi-fused containers with a common cgroup parent, will it be possible to reproduce a Pod-like construct using Docker?

Recently I already tried doing something similar to make multiple containers listen on the same socket, and I know that Docker allows creating a container that reuses an existing network namespace with the docker run --network container:<other-container-name> syntax. But I also know that the OCI Runtime Spec defines only create and start commands. So, when you execute a command inside an existing container with docker exec <existing-container> <command>, you actually run, i.e., create then start, a completely fresh container that just happens to reuse all the namespaces of the target container (proofs 1 & 2). It makes me pretty confident the Pods can be reproduced using the standard Docker commands.

Firstly, a parent cgroup entry needs to be configured. Luckily, these days, it's relatively easy to accomplish, thanks to systemd. For the sake of brevity, I'll configure only the cpu and memory cgroup controllers:

sudo tee /etc/systemd/system/mypod.slice <<EOF
[Unit]
Description=My Pod Slice

[Slice]
MemoryLimit=512M
CPUQuota=50%
EOF
sudo systemctl daemon-reload
sudo systemctl start mypod.slice

Check if the slice was actually created:

sudo systemd-cgls --no-pager --all
Control group /:
-.slice
...
├─mypod.slice
...

Secondly, a sandbox container needs to be created:

docker run -d --rm \
  --name mypod_sandbox \
  --cgroup-parent mypod.slice \
  --ipc 'shareable' \
  alpine sleep infinity

Lastly, we need to start the payload containers reusing the namespaces of the sandbox container:

# app (nginx)
docker run -d --rm \
  --name app \
  --cgroup-parent mypod.slice \
  --network container:mypod_sandbox \
  --ipc container:mypod_sandbox \
  nginx:alpine
# sidecar (sleep)
docker run -d --rm \
  --name sidecar \
  --cgroup-parent mypod.slice \
  --network container:mypod_sandbox \
  --ipc container:mypod_sandbox \
  curlimages/curl sleep 365d

Have you noticed which namespace I omitted? Right, I couldn't share the uts namespace between containers. Seems like this possibility is not exposed currently in the docker run command. Well, that's a pity, of course. But apart from the uts namespace, it's a success!

The cgroups look much like the ones created by Kubernetes itself:

sudo systemd-cgls --no-pager --all

This time, the slice will have several active processes listed:

Control group /:
-.slice
...
├─mypod.slice
│ ├─docker-575fd1bbc28340fbd37c35374dd4ef8a91d796cf4abc2e97eaac42981ae2058a.scope …
│ │ └─1480 sleep infinity
│ ├─docker-c36d2f83cf53ebe5354f7a6f60770b8728772e6c788979d8a35338da102a2fd6.scope …
│ │ └─1312 sleep infinity
│ ├─docker-48dff78e59361aea6876385aa0677c1ad949b0951cb97b9cf7d1e8fba991dc3e.scope …
│ │ └─1669 sleep 365d
│ └─docker-85b436943b55fdb4666d384711ad3577f41c0d03e58987c639633a35a37bacf4.scope …
│   ├─1599 nginx: master process nginx -g daemon off;
│   └─1635 nginx: worker process

The global list of namespaces also looks familiar:

sudo lsns
        NS TYPE   NPROCS   PID USER    COMMAND
...
4026532322 mnt         1  1480 root    sleep infinity
4026532323 uts         1  1480 root    sleep infinity
4026532324 ipc         4  1480 root    sleep infinity
4026532325 pid         1  1480 root    sleep infinity
4026532327 net         4  1480 root    sleep infinity
4026532385 cgroup      1  1480 root    sleep infinity
4026532386 mnt         2  1599 root    nginx: master process nginx -g daemon off;
4026532387 uts         2  1599 root    nginx: master process nginx -g daemon off;
4026532388 pid         2  1599 root    nginx: master process nginx -g daemon off;
4026532389 cgroup      2  1599 root    nginx: master process nginx -g daemon off;
4026532390 mnt         1  1669 _apt    sleep 365d
4026532391 uts         1  1669 _apt    sleep 365d
4026532392 pid         1  1669 _apt    sleep 365d
4026532393 cgroup      1  1669 _apt    sleep 365d

And the app (nginx) and sidecar (curl) containers seems to share the ipc and net namespaces:

# app container
sudo ls -l /proc/1599/ns
lrwxrwxrwx 1 root root 0 Sep 28 13:09 cgroup -> 'cgroup:[4026532389]'
lrwxrwxrwx 1 root root 0 Sep 28 13:09 ipc -> 'ipc:[4026532324]'
lrwxrwxrwx 1 root root 0 Sep 28 13:09 mnt -> 'mnt:[4026532386]'
lrwxrwxrwx 1 root root 0 Sep 28 13:09 net -> 'net:[4026532327]'
lrwxrwxrwx 1 root root 0 Sep 28 13:09 pid -> 'pid:[4026532388]'
lrwxrwxrwx 1 root root 0 Sep 28 13:09 uts -> 'uts:[4026532387]'
...
# sidecar container
sudo ls -l /proc/1669/ns
lrwxrwxrwx 1 _apt messagebus 0 Sep 28 13:09 cgroup -> 'cgroup:[4026532393]'
lrwxrwxrwx 1 _apt messagebus 0 Sep 28 13:09 ipc -> 'ipc:[4026532324]'
lrwxrwxrwx 1 _apt messagebus 0 Sep 28 13:09 mnt -> 'mnt:[4026532390]'
lrwxrwxrwx 1 _apt messagebus 0 Sep 28 13:09 net -> 'net:[4026532327]'
lrwxrwxrwx 1 _apt messagebus 0 Sep 28 13:09 pid -> 'pid:[4026532392]'
lrwxrwxrwx 1 _apt messagebus 0 Sep 28 13:09 uts -> 'uts:[4026532391]'
...

Yay! We just (almost) created a Pod using only the standard docker run command 🎉

Summarizing

Containers and Pods are alike. Under the hood, they heavily rely on Linux namespaces and cgroups. However, Pods aren't just groups of containers. A Pod is a self-sufficient higher-level construct. All Pod's containers run on the same machine (cluster node), their lifecycle is synchronized, and mutual isolation is weakened to simplify the inter-container communication. This makes Pods much closer to traditional VMs, bringing back the familiar deployment patterns like sidecar or client-side service proxy.

Resources

Level up your Server Side game — Join 9,000 engineers who receive insightful learning materials straight to their inbox