You are viewing documentation for Kubernetes version: v1.29

Kubernetes v1.29 documentation is no longer actively maintained. The version you are currently viewing is a static snapshot. For up-to-date information, see the latest version.

Image Filesystem: Configuring Kubernetes to store containers on a separate filesystem

By Kevin Hannon (Red Hat) | Tuesday, January 23, 2024

A common issue in running/operating Kubernetes clusters is running out of disk space. When the node is provisioned, you should aim to have a good amount of storage space for your container images and running containers. The container runtime usually writes to /var. This can be located as a separate partition or on the root filesystem. CRI-O, by default, writes its containers and images to /var/lib/containers, while containerd writes its containers and images to /var/lib/containerd.

In this blog post, we want to bring attention to ways that you can configure your container runtime to store its content separately from the default partition.
This allows for more flexibility in configuring Kubernetes and provides support for adding a larger disk for the container storage while keeping the default filesystem untouched.

One area that needs more explaining is where/what Kubernetes is writing to disk.

Understanding Kubernetes disk usage

Kubernetes has persistent data and ephemeral data. The base path for the kubelet and local Kubernetes-specific storage is configurable, but it is usually assumed to be /var/lib/kubelet. In the Kubernetes docs, this is sometimes referred to as the root or node filesystem. The bulk of this data can be categorized into:

ephemeral storage
logs
and container runtime

This is different from most POSIX systems as the root/node filesystem is not / but the disk that /var/lib/kubelet is on.

Ephemeral storage

Pods and containers can require temporary or transient local storage for their operation. The lifetime of the ephemeral storage does not extend beyond the life of the individual pod, and the ephemeral storage cannot be shared across pods.

Logs

By default, Kubernetes stores the logs of each running container, as files within /var/log. These logs are ephemeral and are monitored by the kubelet to make sure that they do not grow too large while the pods are running.

You can customize the log rotation settings for each node to manage the size of these logs, and configure log shipping (using a 3rd party solution) to avoid relying on the node-local storage.

Container runtime

The container runtime has two different areas of storage for containers and images.

read-only layer: Images are usually denoted as the read-only layer, as they are not modified when containers are running. The read-only layer can consist of multiple layers that are combined into a single read-only layer. There is a thin layer on top of containers that provides ephemeral storage for containers if the container is writing to the filesystem.
writeable layer: Depending on your container runtime, local writes might be implemented as a layered write mechanism (for example, overlayfs on Linux or CimFS on Windows). This is referred to as the writable layer. Local writes could also use a writeable filesystem that is initialized with a full clone of the container image; this is used for some runtimes based on hypervisor virtualisation.

The container runtime filesystem contains both the read-only layer and the writeable layer. This is considered the imagefs in Kubernetes documentation.

Container runtime configurations

CRI-O

CRI-O uses a storage configuration file in TOML format that lets you control how the container runtime stores persistent and temporary data. CRI-O utilizes the storage library.
Some Linux distributions have a manual entry for storage (man 5 containers-storage.conf). The main configuration for storage is located in /etc/containers/storage.conf and one can control the location for temporary data and the root directory.
The root directory is where CRI-O stores the persistent data.

[storage]
# Default storage driver
driver = "overlay"
# Temporary storage location
runroot = "/var/run/containers/storage"
# Primary read/write location of container storage 
graphroot = "/var/lib/containers/storage"

graphroot
- Persistent data stored from the container runtime
- If SELinux is enabled, this must match the /var/lib/containers/storage
runroot
- Temporary read/write access for container
- Recommended to have this on a temporary filesystem

Here is a quick way to relabel your graphroot directory to match /var/lib/containers/storage:

semanage fcontext -a -e /var/lib/containers/storage <YOUR-STORAGE-PATH>
restorecon -R -v <YOUR-STORAGE-PATH>

containerd

The containerd runtime uses a TOML configuration file to control where persistent and ephemeral data is stored. The default path for the config file is located at /etc/containerd/config.toml.

The relevant fields for containerd storage are root and state.

root
- The root directory for containerd metadata
- Default is /var/lib/containerd
- Root also requires SELinux labels if your OS requires it
state
- Temporary data for containerd
- Default is /run/containerd

Kubernetes node pressure eviction

Kubernetes will automatically detect if the container filesystem is split from the node filesystem. When one separates the filesystem, Kubernetes is responsible for monitoring both the node filesystem and the container runtime filesystem. Kubernetes documentation refers to the node filesystem and the container runtime filesystem as nodefs and imagefs. If either nodefs or the imagefs are running out of disk space, then the overall node is considered to have disk pressure. Kubernetes will first reclaim space by deleting unusued containers and images, and then it will resort to evicting pods. On a node that has a nodefs and an imagefs, the kubelet will garbage collect unused container images on imagefs and will remove dead pods and their containers from the nodefs. If there is only a nodefs, then Kubernetes garbage collection includes dead containers, dead pods and unused images.

Kubernetes allows more configurations for determining if your disk is full.
The eviction manager within the kubelet has some configuration settings that let you control the relevant thresholds. For filesystems, the relevant measurements are nodefs.available, nodefs.inodesfree, imagefs.available, and imagefs.inodesfree. If there is not a dedicated disk for the container runtime then imagefs is ignored.

Users can use the existing defaults:

memory.available < 100MiB
nodefs.available < 10%
imagefs.available < 15%
nodefs.inodesFree < 5% (Linux nodes)

Kubernetes allows you to set user defined values in EvictionHard and EvictionSoft in the kubelet configuration file.

EvictionHard: defines limits; once these limits are exceeded, pods will be evicted without any grace period.
EvictionSoft: defines limits; once these limits are exceeded, pods will be evicted with a grace period that can be set per signal.

If you specify a value for EvictionHard, it will replace the defaults.
This means it is important to set all signals in your configuration.

For example, the following kubelet configuration could be used to configure eviction signals and grace period options.

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
address: "192.168.0.8"
port: 20250
serializeImagePulls: false
evictionHard:
    memory.available:  "100Mi"
    nodefs.available:  "10%"
    nodefs.inodesFree: "5%"
    imagefs.available: "15%"
    imagefs.inodesFree: "5%"
evictionSoft:
    memory.available:  "100Mi"
    nodefs.available:  "10%"
    nodefs.inodesFree: "5%"
    imagefs.available: "15%"
    imagefs.inodesFree: "5%"
evictionSoftGracePeriod:
    memory.available:  "1m30s"
    nodefs.available:  "2m"
    nodefs.inodesFree: "2m"
    imagefs.available: "2m"
    imagefs.inodesFree: "2m"
evictionMaxPodGracePeriod: 60s

Problems

The Kubernetes project recommends that you either use the default settings for eviction or you set all the fields for eviction. You can use the default settings or specify your own evictionHard settings. If you miss a signal, then Kubernetes will not monitor that resource. One common misconfiguration administrators or users can hit is mounting a new filesystem to /var/lib/containers/storage or /var/lib/containerd. Kubernetes will detect a separate filesystem, so you want to make sure to check that imagefs.inodesfree and imagefs.available match your needs if you've done this.

Another area of confusion is that ephemeral storage reporting does not change if you define an image filesystem for your node. The image filesystem (imagefs) is used to store container image layers; if a container writes to its own root filesystem, that local write doesn't count towards the size of the container image. The place where the container runtime stores those local modifications is runtime-defined, but is often the image filesystem. If a container in a pod is writing to a filesystem-backed emptyDir volume, then this uses space from the nodefs filesystem. The kubelet always reports ephemeral storage capacity and allocations based on the filesystem represented by nodefs; this can be confusing when ephemeral writes are actually going to the image filesystem.

Future work

To fix the ephemeral storage reporting limitations and provide more configuration options to the container runtime, SIG Node are working on KEP-4191. In KEP-4191, Kubernetes will detect if the writeable layer is separated from the read-only layer (images). This would allow us to have all ephemeral storage, including the writeable layer, on the same disk as well as allowing for a separate disk for images.

Getting involved

If you would like to get involved, you can join Kubernetes Node Special-Interest-Group (SIG).

If you would like to share feedback, you can do so on our #sig-node Slack channel. If you're not already part of that Slack workspace, you can visit https://slack.k8s.io/ for an invitation.

Special thanks to all the contributors who provided great reviews, shared valuable insights or suggested the topic idea.

Peter Hunt
Mrunal Patel
Ryan Phillips
Gaurav Singh

←Previous