How To Scale in Kubernetes

One of the many selling points of Kubernetes is the ease through which you can scale your applications in a matter of seconds. For example, it took me less than 10 seconds to scale an nginx webserver from 1 to 10 replicas. However, before talking about scaling, let's first have a look at resource limits in Kubernetes.

Resource Limits

apiVersion: apps/v1
kind: Deployment
metadata:
    name: nginx-deployment
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
       - name: nginx
         image: nginx:latest
         resources:
           requests:
             cpu: 100m
             memory: 50Mi
           limits:
             cpu: 500m
             memory: 200Mi

This is the manifest file for a deployment, where the image used is the standard nginx:latest one. However, you can see in the specs for the nginx container, I added resource requests and limits. A request is used for scheduling of pods, so Kubelet will place the pod on the node with the least resource utilisation and which also has those resource requests free. On the other hand, the limits imposed are "hard limits" and the kernel will terminate the offending process in the container if it crosses that hard limit.

Something else you may not be totally familiar with is "100m" as CPU spec. Here, CPUs are allocated by milliCPUs. 1mCPU is essentially a thousandth of a core, or vCPU, depending on your environment. It is also fairly consistent across different CPUs so 100m on a 2 core system will take up as much of one core as on a 128 core system. However, CPU orchestration is complicated and way beyond the scope of this blogpost.

Scaling

Alright, now that we understand the idea of resource limitations in Kubernetes, I can start on scaling. While not immediately apparent, I will explain why I went over resources first.

You can do a simple scale on deployments. If you remember from the last blogpost, depoyments wrap around replicasets which control the number of replicas of a pod. This makes it very easy to scale up or down:

See! It's as easy as kubectl scale deployment <deployment name> --replicas=<desired number>

Horizontal Pod Autoscaling

The Horizontal Pod Autoscaler (HPA) is one of the most powerful tools of Kubernetes, and it's something I sorely missed in Docker Swarm. As the name suggests, the HPA automatically scales pods of a deployment (eg the nginx pods above) depending on load. In case you're unfamiliar, scaling a server horizontally means having more servers of the same power, while vertical scaling means providing more compute power to the existing server.

The HPA is one of the factors which makes running Kubernetes a cost-efficient option, as not only will it scale up to match demand (eg peak hours) but it will scale down when there is less demand. This means you're only using the resources you need at a given time, and you waste less compute on resources which would otherwise remain underutilised or idle.

The HPA is fairly easy to setup too.

kubectl autoscale deployment nginx-demo --cpu-percent=75 --min=2 --max=10

You can then "get" the HPA by running kubectl get hpa where you can see details about the HPA just created, including current percentage CPU usage and the target at which it will scale. You also cannot create an HPA which has limits of both CPU and RAM through the command line like this. Instead, you need to create a manifest file which you will then deploy. The docs for this can be found here on the official K8S docs.

Vertical Pod Autoscaling

This is somewhat similar to how you would upgrade your PC if you want to play the latest unoptimised AAA game. While a VPA is not built-into Kubernetes in the same way as the HPA, it can be achieved manually by adding the following:

resizePolicy:
    - resourceName: cpu
      restartPolicy: NotRequired
    - resourceName: memory
      restartPolicy: NotRequired

under the spec.containers.<container> in the yaml configuration, additional to the requests and limits as you saw before. Afterwards, you can patch a running pod through the kubectl patch command, with a bit of handy json:

'{"spec":{"containers":[{"nginx", "resources":{"requests":{"cpu":"800m"}, "limits":{"cpu":"800m"}}}]}}'

This will edit the requests and limits on the spot, thus vertically scaling the running pods. However, I haven't seen many people do such scaling often and I feel HPA should do the trick as long as you know the requirements of your running container.

Conclusion

And on that note, you are now able to scale your deployments to manage the spike in traffic you will undoubtedly get when you write a hit blogpost that ends up ranked 1 on hackernews. I figure this is particularly important when going to the cloud as opposed to bare metal as you save quite a lot of cost when freeing unused resources, and of course the impact is compounded the more services you put on Kubernetes (ie you're a company with loads of $$). Well, anything to keep that $1k AWS bill down, right?