How to use Spot to achieve Cost Savings with Stability on Kubernetes

Devtron
7 min readMay 13, 2021

by- Prashant Ghildiyal, Co-Founder — Devtron Labs

Cost-saving has always been one of the important objectives for organizations, but now it is more important than ever before. Because of uncertainty in the business, the earlier motto of growth at all costs has been replaced with responsible growth.

This post will focus on how you can leverage spot instances of aws for cost saving in Kubernetes clusters without compromising on stability.

You must be thinking that it is trivial as Kubernetes supports it out of the box; hold your thoughts for a while. I promise you, by the end of this post, you will know the best possible solution to use spot instances in Kubernetes clusters using mechanisms provided by Kubernetes.

Spot instances of aws are usually available at 10% of the cost of on-demand instances, but their reliability is low. If the price of spot instances goes beyond your bidding price, they will be terminated by aws within 2 mins. Therefore we must distribute pods of our microservice judiciously across spot and on-demand instances.

How to handle termination notification and drain resources? It is also important to maintain the SLA of microservices but is beyond the scope of this article. We will cover that in a separate article.

How does autoscaling work in Kubernetes?

Before we go into details, let's understand the autoscaling of nodes in the Kubernetes cluster.

  • If the Kube scheduler cannot place the pod on any of the nodes, it marks the pod as unschedulable.
  • Cluster autoscaler watches unschedulable pods.
  • If the cluster autoscaler finds an unschedulable pod, it filters and prioritizes node groups to select a node group on which this pod can be scheduled.
  • Cluster autoscaler increases desired instance count in the Autoscaling Group (ASG) of the selected node group.
  • ASG scale nodes based on the scaling strategy.
  • Kube Scheduler filters and prioritizes nodes and schedules the pod on the node with the highest priority score.

Without much further ado, let’s start our journey.

Spoiler Alert: First two attempts are failures, and the third attempt is successful.

Attempt 1

Based on my discussion, this is the second most popular approach to use spot instances with Kubernetes. It goes like this.

Based on my discussion, this is the second most popular approach to use spot instances with Kubernetes. It goes like this.

If nodes have the right spot-is-to-on demand ratio then pods will automatically have the right ratio.

Kops supports the mixed instance group for AWS since version 1.14. Mixed Instance groups can be used to achieve the right ratio of spot and on-demand instances.

Let’s look at a relevant part of a sample instance group configuration.

spec:
mixedInstancesPolicy:
onDemandBase: 3
onDemandAboveBase: 30
spotAllocationStrategy: capacity-optimized
nodeLabels:
lifecycle: Spot

As per the above configuration minimum of 3 demand nodes will be available. For additional requirements, 30% of the nodes will be on-demand type, and the rest 70% will be of spot type.

For node affinity, the following is the relevant portion of the pod spec

spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: lifecycle
operator: In
values:
- Spot

Case 1: Scaling not Required

If the cluster can schedule the pod, the Kube scheduler will use its filter and priority algorithms to schedule the pod on the best possible node.

The priority algorithm of the Kube scheduler doesn’t differentiate between spot and on-demand nodes. Therefore, even though the nodes will be in approx 70-is-to-30 ratio for spot-is-to-on demand, the distribution of pods across these nodes may not be in this ratio.

Case 2: Scaling Required

If the cluster doesn’t have the capacity to schedule the pod, then the cluster autoscaler will increase the desired instances in ASG.

ASG will then provision a new node such that the ratio of 70-is-to-30 ratio for spot-is-to-on demand is maintained.

After provisioning, the Kube scheduler will schedule the pod to the new node, assuming there were no pod evictions in between. So in case of a scaling event, the pod will be assigned to the right kind of node.

Can we do better

We can use inter pod anti-affinity for better distribution of pods, but it will still not guarantee distribution of 70-is-to-30 unless the number of pods is equal to the number of nodes.

Outcome: failure

Even though nodes will have the desired spot-is-to-on demand ratio, pods may or may not be spread in this ratio, resulting in unstable services in case of a spot node outage.

Attempt 2

This is the most often cited approach to use spot instances in Kubernetes clusters.

Use node affinity to control distribution of pods across spot and on demand nodes.

For this to work, at least two node groups are needed, one with spot instances only and on-demand instances only.

Following are the relevant configurations from the two node groups; this can be done without a mixed instance node group.

For spot

spec:
mixedInstancesPolicy:
onDemandBase: 0
onDemandAboveBase: 0
spotAllocationStrategy: capacity-optimized
nodeLabels:
lifecycle: Spot

Similarly, for on-demand

spec:
mixedInstancesPolicy:
onDemandBase: 3
onDemandAboveBase: 100
nodeLabels:
lifecycle: OnDemand

Following is the relevant pod spec for node affinity

spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 70
preference:
matchExpressions:
- key: lifecycle
operator: In
values:
- Spot
- weight: 30
preference:
matchExpressions:
- key: lifecycle
operator: In
values:
- OnDemand

What does weight mean?

Weight of 30 and 70 doesn’t mean the scheduler will distribute pods between these two node labels in the ratio of 30-is-to-70.

Scheduler combines the weight mentioned in the above spec with the ones it has computed using priority functions and assigns pod to node group with the highest score.

Case 1: Scaling not required

If scaling is not required, it will prefer to schedule on spot instances as it weights 70, though actual placement may depend on the score obtained through various priority functions used by the Kube scheduler.

Case 2: Scaling required

If the scaling of nodes is required to schedule the pod, then the cluster autoscaler will filter all node groups and prioritize eligible nodegroups based on its priority algorithm.

Priority algorithms used by the cluster autoscaler are not the same as the Kube scheduler. By default, it uses the random algorithm to pick one node group out of eligible node groups randomly. It will select one node group in random order and increase the desired instance count in ASG.

Once ASG has provisioned the node, the Kube scheduler will assign the pod to this new node.

Can we do better?

No, pod anti-affinity will not help as the ratio of spot-is-to-on demand is not equal to the desired ratio.

Outcome: failure

Neither node will not have the desired spot-is-to-on demand ratio, nor will pods have the desired ratio. This turns out to be worse than attempt 1.

Attempt 3

This is the least-mentioned approach; it uses Pod Topology Spread Contraints, which was introduced in Kubernetes 1.16 and was made beta in 1.18. We will use pod topology spread constraints to control how pods are spread across the spot and on-demand instances in the cluster.

Following are the relevant configurations from the two-node groups; again, this can be done without a mixed instance nodegroup.

For spot

spec:
mixedInstancesPolicy:
onDemandBase: 0
onDemandAboveBase: 0
spotAllocationStrategy: capacity-optimized
nodeLabels:
lifecycle: Spot

Similarly, for on-demand

spec:
mixedInstancesPolicy:
onDemandBase: 3
onDemandAboveBase: 100
nodeLabels:
lifecycle: OnDemand

Following is the relevant portion of the pod spec for pod topology constraints.

metadata:
labels:
app: sample
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: lifecycle
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: sample

How topologySpreadConstraints work?

Topology Spread Constraints uses node labels to identify the topology domain(s) of the node. topologyKey is the key of node labels. Kube scheduler tries to place a balanced number of pods across all unique values of this node label (topologyKey).

In our example, topologyKey its lifecycle, which has 2 unique values Spot and OnDemand. Kube Scheduler will place pods across nodes with these two values such that the maximum difference between pod count across these two values cannot be more than maxSkew(1 in this case).

If this label were missing in any node group, the scheduler would not have scheduled a pod to that node group.

An important point to note is that maxSkew doesn’t favor any label value against topologyKey. It will skew in either direction based on the availability and priority of nodes. Though it will ensure that unbalance between label values is not more than maxSkew.

whenUnsatisfiable is set to DoNotSchedule, the Kube scheduler ensures that the pod is not scheduled so that maxSkew cannot be maintained.

Case 1: Scaling not required

If scaling is not required, the Kube-scheduler will filter and prioritize node groups that honor the maxSkew; pods will be scheduled in the desired ratio.

Case 2: Scaling required

When scaling is required, cluster autoscaler will filter node groups which honor the topology constraints and increment desired instance number in related ASG.

After ASG has scaled the instance, the Kube scheduler will assign the pod to the node; therefore, pods will be scheduled in the desired ratio.

What’s the catch?

maxSkew is a number, which means that when we use it with HPA as pods will scale, the ratio of spot-is-to-on demand will change.

maxSkew can be on either side; it is possible to have

  1. number of pods on spot = number of pods on ondemand + maxSkew
  2. number of pods on ondemand = number of pods on spot + maxSkew

Which means, for a replica count of 5 and maxSkew of 1, the ratio can be 2-is-to-3 or 3-is-to-2 for spot-is-to-on demand nodes. This becomes more unpredictable as the value of maxSkew becomes higher.

To achieve higher skew, it's better to create more buckets with the topologyKey.

Outcome: failure

We cannot get an exact spot-is-to-on-demand ratio, but we can have a predictable ratio nonetheless.

Complete configuration of samples used in this blog is available in this git repo.

--

--