Quick Links

Kubernetes is a distributed system that's designed to scale replicas of your services across multiple physical environments. In many cases this works well out-of-the-box. The Kubernetes scheduler automatically places your Pods (container instances) onto Nodes (worker machines) that have enough resources to support them.

Despite its best efforts, sometimes the scheduler won't select a plan you agree with. You might want Pods to be colocated if they'll be regularly communicating over the network; alternatively, some compute-intensive Pods might be best allocated to separate Nodes wherever possible.

Kubernetes has several mechanisms which let you guide the scheduler's decision-making process so Pods end up on particular Nodes. In this article, we'll focus specifically on the "affinity" and "anti-affinity" concepts that give you granular control of scheduling. Affinities define rules that either must or should be met before a Pod can be allocated to a Node.

How Does Affinity Work?

Affinities are used to express Pod scheduling constraints that can match characteristics of candidate Nodes and the Pods that are already running on those Nodes. A Pod that has an "affinity" to a given Node is more likely to be scheduled to it; conversely, an "anti-affinity" makes it less probable it'll be scheduled. The overall balance of these weights is used to determine the final placement of each Pod.

Affinity assessments can produce either hard or soft outcomes. A "hard" result means the Node must have the characteristics defined by the affinity expression. "Soft" affinities act as a preference, indicating to the scheduler that it should use a Node with the characteristics if one is available. A Node that doesn't meet the condition will still be selected if necessary.

Types of Affinity Condition

There are currently two different kinds of affinity that you can define:

  • Node Affinity - Used to constrain the Nodes that can receive a Pod by matching labels of those Nodes. Node Affinity can only be used to set positive affinities that attract Pods to the Node.
  • Inter-Pod Affinity - Used to constrain the Nodes that can receive a Pod by matching labels of the existing Pods already running on each of those Nodes. Inter-Pod Affinity can be either an attracting affinity or a repelling anti-affinity.

In the simplest possible example, a Pod that includes a Node Affinity condition of

        label=value
    

will only be scheduled to Nodes with a

        label=value
    

label. A Pod with the same condition but defined as an Inter-Pod Affinity will be scheduled to a Node that already hosts a Pod with a

        label=value
    

label.

Setting Node Affinities

Node Affinity has two distinct sub-types:

  •         requiredDuringSchedulingIgnoredDuringExecution
        
    - This is the "hard" affinity matcher that requires the Node meet the constraints you define.
  •         preferredDuringSchedulingIgnoredDuringExecution
        
    - This is the "soft" matcher to express a preference that's ignored when it can't be fulfilled.

The

        IgnoredDuringExecution
    

part of these verbose names makes it explicit that affinity is only considered while scheduling Pods. Once a Pod has made it onto a Node, affinity isn't re-evaluated. Changes to the Node won't cause a Pod eviction due to changed affinity values. A future Kubernetes release could add support for this behavior via the reserved

        requiredDuringSchedulingRequiredDuringExecution
    

phrase.

Node affinities are attached to Pods via their

        spec.affinity.nodeAffinity
    

manifest field:

apiVersion: v1
    

kind: Pod

metadata:

name: demo-pod

spec:

containers:

- name: demo-container

# ...

affinity:

nodeAffinity:

requiredDuringSchedulingIgnoredDuringExecution:

nodeSelectorTerms:

- matchExpressions:

- key: hardware-class

operator: In

values:

- a

- b

- c

- matchExpressions:

- key: internal

operator: Exists

This manifest creates a hard affinity rule that schedules the Pod to a Node meeting the following criteria:

  • It has a
            hardware-class
        
    label with either
            a
        
    ,
            b
        
    , or
            c
        
    as the value.
  • It has an
            internal
        
    label with any value.

You can attach additional conditions by repeating the

        matchExpressions
    

clause. Supported operators for value comparisons are

        In
    

,

        NotIn
    

,

        Exists
    

,

        DoesNotExist
    

,

        Gt
    

(greater than), and

        Lt
    

(less than).

The

        matchExpression
    

clauses grouped under a single

        nodeSelectorTerms
    

clause are combined with a boolean

        AND
    

. They all need to match for a Pod to gain affinity to a particular Node. You can use multiple

        nodeSelectorTerms
    

clauses too; these will be combined as a logical

        OR
    

operation. You can easily assemble complex scheduling criteria by utilizing both of these structures.

"Soft" scheduling preferences are set up in a similar way. Use

        nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution
    

instead of or as well as

        requiredDuringSchedulingIgnoredDuringExecution
    

to configure these. Define each of your optional constraints as a

        matchExpressions
    

clause within a

        preference
    

field:

apiVersion: v1
    

kind: Pod

metadata:

name: demo-pod

spec:

containers:

- name: demo-container

# ...

affinity:

nodeAffinity:

preferredDuringSchedulingIgnoredDuringExecution:

- weight: 1

preference:

matchExpressions:

- key: hardware-class

operator: In

values:

- a

- b

- c

Preference-based rules have an additional field called

        weight
    

that accepts an integer from 1 to 100. Each Node that matches a preference has its total affinity weight incremented by the set amount; the Node that ends up with the highest overall weight will be allocated the Pod.

Setting Inter-Pod Affinities

Inter-Pod Affinities work very similarly to Node Affinities but do have some important differences. The "hard" and "soft" modes are indicated using the same

        requiredDuringSchedulingIgnoredDuringExecution
    

and

        preferredDuringSchedulingIgnoredDuringExecution
    

fields. These should be nested under the

        spec.affinity.podAffinity
    

or

        spec.affinity.podAntiAffinity
    

fields depending on whether you want to increase or reduce the Pod's affinity upon a successful match.

Here's a simple example that demonstrates both affinity and anti-affinity:

apiVersion: v1
    

kind: Pod

metadata:

name: demo-pod

spec:

containers:

- name: demo-container

# ...

affinity:

podAffinity:

requiredDuringSchedulingIgnoredDuringExecution:

- labelSelector:

matchExpressions:

- key: hardware-class

operator: In

values:

- a

- b

- c

topologyKey: topology.kubernetes.io/zone

podAntiAffinity:

preferredDuringSchedulingIgnoredDuringExecution:

- weight: 1

podAffinityTerm:

- labelSelector:

matchExpressions:

- key: app-component

operator: In

values:

- background-worker

topologyKey: topology.kubernetes.io/zone

The format differs slightly from Node Affinity. Each

        matchExpressions
    

constraint needs to be nested under a

        labelSelector
    

. For soft matches, this in turn should be located within a

        podAffinityTerm
    

. Pod affinities also offer a reduced set of comparison operators: you can use

        In
    

,

        NotIn
    

,

        Exists
    

and

        DoesNotExist
    

.

Pod affinities need a

        topologyKey
    

field. This is used to limit the overall set of Nodes that are considered eligible for scheduling, before the

        matchExpressions
    

are evaluated. The rules above will schedule the Pod to a Node with the

        topology.kubernetes.io/zone
    

label and an existing Pod with the

        hardware-class
    

label set to

        a
    

,

        b
    

, or

        c
    

. Nodes that also have a Pod with the

        app-component=background-worker
    

label will be given a reduced affinity.

Inter-Pod affinities are a powerful mechanism for controlling colocation of Pods. However they do have a significant impact on performance: Kubernetes warns against using them in clusters with more than a few hundred Nodes. Each new Pod scheduling request needs to check every other Pod on all the other Nodes to assess compatibility.

Other Scheduling Constraints

While we've focused on affinities in this article, Kubernetes provides other scheduler constraint mechanisms too. These are typically simpler but less automated approaches that work well for smaller clusters and deployments.

The most basic constraint is the

        nodeSelector
    

field. It's defined on Pods as a set of label key-value pairs that must exist on Nodes hosting the Pod:

apiVersion: v1
    

kind: Pod

metadata:

name: demo

spec:

containers:

- name: demo

# ...

nodeSelector:

hardware-class: a

internal: true

This manifest instructs Kubernetes to only schedule the Pod to Nodes with both the

        hardware-class: a
    

and

        internal: true
    

labels.

Node selection with the

        nodeSelector
    

field is a good way to quickly scaffold static configuration based on long-lived attributes of your Nodes. The affinity system is much more flexible when you want to express complex rules and optional preferences.

Conclusion

Affinities and anti-affinities are used to set up versatile Pod scheduling constraints in Kubernetes. Compared to other options like

        nodeSelector
    

, affinities are complex but give you more ways to identify compatible Nodes.

Affinities can act as soft preferences that signal a Pod's "ideal" environment to Kubernetes even if it can't be immediately satisfied. The system also has the unique ability of filtering Nodes based on their existing workloads so you can implement Pod colocation rules.

One final point to note is that affinity isn't the end of the scheduling process. A Pod with a strong computed affinity to a Node might still end up elsewhere because of the input of Node taints. This mechanism lets you manage scheduling requests from the perspective of your Nodes. Taints actively repel incoming Pods away to other Nodes, effectively the opposite of the magnetic attraction of affinities. Node selectors, affinities, taints, and tolerations are all balanced to determine the final in-cluster location of each new Pod.