Ultimate Guide Of Pod Eviction On Kubernetes

6 min readMay 13, 2021

by- Prashant Ghildiyal, Co-Founder - Devtron Labs

By nature, pods in Kubernetes clusters are ephemeral; they can be created, killed, moved around by the scheduler, and pods can be evicted. This may occasionally disrupt the Microservices if pods are not configured properly.

In this article, we will look at two scenarios that will impact the stability of pod because of pod eviction

Pod preemption
Out of resource eviction

and how we can secure our pods by ensuring

Quality of Service
Pod Priority

Quality of Service

There is no direct method to specify the Quality of Service (QoS) of pods. Kubernetes determines the quality of service based on the resource request and limit of the pods.

Each container specifies a request for a resource, which is the amount of resource guaranteed by the Kubernetes. A limit for a resource that is the maximum amount of resource Kubernetes will allow the container to use.

Pod level requests and limits are computed by adding per-resource level requests and limits across all pod containers. Kubernetes currently provide three QoS based on pod level request and limit.

Guaranteed
Every container in the pod has CPU request and limit with request == limit
Every container in the pod has memory request and limit with request == limit

apiVersion: v1
kind: Pod
metadata:
name: guaranteed-nginx
namespace: demo
spec:
  containers:
    name: guaranteed-nginx
    image: nginx
    resources:
      limits:
        memory: "512Mi"
        cpu: "1024m"
      requests:
        memory: "512Mi"
        cpu: "1024m"

Burstable
At least one container has memory and CPU request
The pod should not meet the criteria of Guaranteed as mentioned above

apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-nginx
  namespace: demo
spec:
  containers:
    name: guaranteed-nginx
    image: nginx
    resources:
      limits:
        memory: "1024Mi"
      requests:
        memory: "512Mi"

Best Effort
None of the containers have any memory or CPU request or limit

apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-nginx
  namespace: demo
spec:
  containers:
  - name: guaranteed-nginx
    image: nginx

Pod Priority

Kubernetes exposes two specs, priority and priorityClassName, to define the priority of pods. This is used along with spec preemptionPolicy, which can have value Never or PreemptLowerPriority.

Pod with higher priority is placed ahead in the scheduling; if preemptionPolicy is set to PreemptLowerPriority and no node is found which satisfies the requirements of the pod, then the scheduler will evict lower priority pods to create space for it.

PriorityClass config with preemption disabled

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
   name: high-priority
preemptionPolicy: Never
value: 1000000
globalDefault: false
description: "This priority class will not preempt other pods."

Pod with priority class

apiVersion: v1
kind: Pod
metadata:
   name: nginx
   labels:
     env: demo
spec:
   containers:
   - name: nginx
     image: nginx
     imagePullPolicy: IfNotPresent
   priorityClassName: high-priority

How quality of service (QoS) and pod priority relates to the stability of pods?

Let’s analyze the role of Quality of Service and Pod Priority concerning the stability of pods for preemption and eviction.

Pod preemption

When pods are created, they are placed into the scheduling queue based on their priority. The scheduler picks up a pod for scheduling and filters nodes based on the requirements specified by the pod. If the scheduler cannot find any suitable node for the pod, then preemption logic is invoked for the pending pod provided preemptionPolicy for the pending pod is not set to Never.

Preemption logic tries to find nodes with lower priority pods than the pending pod so that the pending pod can be scheduled on this node after removing low priority pods.

Quality of Service doesn’t have any impact on pod preemption; it is affected by the pod priority and preemptionPolicy.

Limitations of pod preemptionSetting up priority for the first time on the existing cluster

When you set up pod priority for the first time, you must start with the pods with the highest priority or keep preemptionPolicy as Never. Because the default priority of the pod is 0, if you set priority for the low priority pod first, it may preempt the critical pod that may not have a priority set and may result in an outage.

Grafana faced a ~30 minutes outage, as blogged here, which was attributed to applying pod priority in the wrong order.

PodDisruptionBudget (PDB) is not guaranteed

PDB is only on a best effort basis. It will try to find a node such h that the eviction of lower priority pods wiviolateviolate PDB. But if it is unable to find any such node, it will evict low priority pods from the node to schedule a high priority pod even if eviction results in violation of PDB and may result in an outage.

So, what purpose is served by PodDisruptionBudget?

PodDisruptionBudget comes into the picture in voluntary disruption, for e.g., node drain or downscale during cluster autoscaling. PodDisruption budget limits the number of pods of an application that can be down simultaneously, thereby ensuring the quality of service is not impacted.

Affinity with low priority pod

In case high priority pod (H) has inter pod affinity with lower priority pod (L), it is possible that scheduler may end up evicting L from the node in order to make space for H. If it happens then inter pod affinity will no longer be satisfied and H will not be scheduled on this node. This loop can continue and can harm the availability of services.

You can avoid it by ensuring that pod with preemptionPolicy PreemptLowerPriority has inter pod affinity with a pod of equal or higher priority.

Preemption may not follow strict priority orders.

The scheduler finds nodes with lower priority pods to run pending pods after lower priority pods. If it’s not feasible to run a pending pod on the node with low priority pods, it may select a node with a higher priority pod(prioritize these pods may be higher than pod on other nodes but will be lower compared to pending pods).

To run a pending pod, the scheduler attempts to select nodes with the lowest priority pods. Still, if it’s not possible to run pending pods on the node after an eviction or those pods are protected by the pod disruption budget, it will evict higher priority pods.

Best practices for pod preemption

Always use PodClassPriority and not priority directly
Don’t create too many levels of priorities
Have preemptPolicy PreemptLowerPriority only for critical services and use system-cluster-critical or system-node-critical as priorityClassName

Out of resource eviction

In over-committed nodes, pods will be killed if the system runs out of resources. Kubelet proactively monitors compute resources for eviction. It supports eviction decisions based on incompressible resources, namely

memory.available
nodefs.available
nodefs.inodesFree
imagefs.available
imagefs.inodesFree

Eviction doesn’t happen if pressure is on compressible resources, e.g., CPU.

Kubernetes allows us to define two thresholds to control the eviction policy of the pods.

Soft eviction threshold

If a soft eviction threshold is reached, then pods are evicted with a grace period. The grace period is calculated as the minimum of the pod termination grace period and soft eviction grace period. If soft eviction grace period is not specified then pods are killed immediately.

Hard eviction threshold

If hard eviction threshold is reached, then pods are evicted immediately without any grace period.

Eviction policy

In the case of imagefs or nodefs pressure, it sorts pods based on the local volumes + logs + writable layers of all containers.

In the case of memory pressure, pods are sorted first based on whether their memory usage exceeds their request or not, then by pod priority, and then by consumption of memory relative to memory requests. Pods that don’t exceed memory requests are not evicted. A lower priority pod that doesn’t exceed memory requests will not be evicted. At the same time, a higher priority pod which exceeds memory request will be evicted.

Best practices for out of resource eviction

Always define memory request and limit for pods
For critical pods over provision resource request and limit with request equal to limit so that pods have guaranteed QoS and are not evicted in case of memory pressure
For non-critical pods, keep resource requests 80–90% of the limit, this allows Kubernetes to oversubscribe nodes and will provide a good trade-off between cost and QoS.

Node Out of Memory (OOM) kill

If a node experiences OOM behavior before Kubelet can reclaim memory, the node depends on oom_killer to respond.

oom_killer calculates oom_score such that containers with the lowest quality of service that are consuming the largest amount of memory relative to memory requests should be killed first.

Kubelet may restart oom killed pods depending on the restart policy, unlike eviction.