Cordons and Drains: How to Prepare a Kubernetes Node for Maintenance

Quick Links

Applying a Node Cordon

Draining a Node

Ignoring Pod Grace Periods

Solving Drain Errors

Minimizing Downtime With Pod Disruption Budgets

Bringing Nodes Back Up

Summary

Kubernetes Nodes need occasional maintenance. You could be updating the Node's kernel, resizing its compute resource in your cloud account, or replacing physical hardware components in a self-hosted installation.

Kubernetes cordons and drains are two mechanisms you can use to safely prepare for Node downtime. They allow workloads running on a target Node to be rescheduled onto other ones. You can then shutdown the Node or remove it from your cluster without impacting service availability.

Applying a Node Cordon

Cordoning a Node marks it as unavailable to the Kubernetes scheduler. The Node will be ineligible to host any new Pods subsequently added to your cluster.

Use the

        kubectl cordon

command to place a cordon around a named Node:

        $ kubectl cordon node-1
node/node-1 cordoned

Existing Pods already running on the Node won't be affected by the cordon. They'll remain accessible and will still be hosted by the cordoned Node.

You can check which of your Nodes are currently cordoned with the get nodes command:

        $ kubectl get nodes
NAME       STATUS                     ROLES                  AGE   VERSION
node-1     Ready,SchedulingDisabled   control-plane,master   26m   v1.23.3

Cordoned nodes appear with the SchedulingDisabled status.

Draining a Node

The next step is to drain remaining Pods out of the Node. This procedure will evict the Pods so they're rescheduled onto other Nodes in your cluster. Pods are allowed to gracefully terminate before they're forcefully removed from the target Node.

Run kubectl drain to initiate a drain procedure. Specify the name of the Node you're taking out for maintenance:

        $ kubectl drain node-1
node/node-1 already cordoned
evicting pod kube-system/storage-provisioner
evicting pod default/nginx-7c658794b9-zszdd
evicting pod kube-system/coredns-64897985d-dp6lx
pod/storage-provisioner evicted
pod/nginx-7c658794b9-zszdd evicted
pod/coredns-64897985d-dp6lx evicted
node/node-1 evicted

The drain procedure first cordons the Node if you've not already placed one manually. It will then evict running Kubernetes workloads by safely rescheduling them to other Nodes in your cluster.

You can shutdown or destroy the Node once the drain's completed. You've freed the Node from its responsibilities to your cluster. The cordon provides an assurance that no new workloads have been scheduled since the drain completed.

Ignoring Pod Grace Periods

Drains can sometimes take a while to complete if your Pods have long grace periods. This might not be ideal when you need to urgently take a Node offline. Use the --grace-period flag to override Pod termination grace periods and force an immediate eviction:

        $ kubectl drain node-1 --grace-period 0

This should be used with care - some workloads might not respond well if they're stopped without being offered a chance to clean up.

Solving Drain Errors

Drains can sometimes result in an error depending on the types of Pod that exist in your cluster. Here are two common issues with their resolutions.

1. "Cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, or StatefulSet"

This message appears if the Node hosts Pods which aren't managed by a controller. It refers to Pods that have been created as standalone objects, where they're not part of a higher-level resource like a Deployment or ReplicaSet.

Kubernetes can't automatically reschedule these "bare" Pods so evicting them will cause them to become unavailable. Either manually address these Pods before performing the drain or use the --force flag to permit their deletion:

        $ kubectl drain node-1 --force

2. "Cannot Delete DaemonSet-managed Pods"

Pods that are part of daemon sets pose a challenge to evictions. DaemonSet controllers disregard the schedulable status of your Nodes. Deleting a Pod that's part of a DaemonSet will cause it to immediately return, even if you've cordoned the Node. Drain operations consequently abort with an error to warn you about this behavior.

You can proceed with the eviction by adding the --ignore-daemonsets flag. This will evict everything else while overlooking any DaemonSets that exist.

        $ kubectl drain node-1 --ignore-daemonsets

You might need to use this flag even if you've not created any DaemonSets yourself. Internal components within the kube-system namespace could be using DaemonSet resources.

Minimizing Downtime With Pod Disruption Budgets

Draining a Node doesn't guarantee your workloads will remain accessible throughout. Your other Nodes will need time to honor scheduling requests and create new containers.

This can be particularly impactful if you're draining multiple Nodes in a short space of time. Draining the first Node could reschedule its Pods onto the second Node, which is itself then deleted.

Pod disruption budgets are a mechanism for avoiding this situation. You can use them with Deployments, ReplicationControllers, ReplicaSets, and StatefulSets.

Objects that are targeted by a Pod disruption budget are guaranteed to have a specific number of accessible Pods at any given time. Kubernetes will block Node drains that would cause the number of available Pods to fall too low.

Here's an example of a PodDisruptionBudget YAML object:

apiVersion: policy/v1

kind: PodDisruptionBudget

metadata:

name: demo-pdb

spec:

minAvailable: 4

selector:

matchLabels:

app: my-app

This policy requires there be at least four running Pods with the app=my-app label. Node drains that would cause only three Pods to be schedulable will be prevented.

The level of disruption allowed is expressed as either the maxUnavailable or minAvailable field. Only one of these can exist in a single Pod Disruption Budget object. Each one accepts an absolute number of Pods or a percentage that's relative to the total number of Pods at full availability:

minAvailable: 4 - Require at least four Pods to be available.
maxUnavailable: 50% - Allow up to half of the total number of Pods to be unavailable.

Overriding Pod Disruption Budgets

Pod disruption budgets are a mechanism that provide protection for your workloads. They shouldn't be overridden unless you must immediately shutdown a Node. The drain command's --disable-eviction flag provides a way to achieve this.

        $ kubectl drain node-1 --disable-eviction

This circumvents the regular Pod eviction process. Pods will be directly deleted instead, ignoring any applied disruption budgets.

Bringing Nodes Back Up

Once you've completed your maintenance, you can power the Node back up to reconnect it to your cluster. You must then remove the cordon you created to mark the Node as schedulable again:

        $ kubectl uncordon node-1
node/node-1 uncordoned

Kubernetes will begin to allocate new workloads to the Node, returning it to active service.

Summary

Maintenance of Kubernetes Nodes shouldn't be attempted until you've drained existing workloads and established a cordon. These measures help you avoid unexpected downtime when servicing actively used Nodes.

Basic drains are often adequate if you've got capacity in your cluster to immediately reschedule your workloads to other Nodes. Use Pod disruption budgets in situations where consistent availability must be guaranteed. They let you guard against unintentional downtime when multiple drains are commenced concurrently.