Multi threading with Millicores in Kubernetes Multi threading with Millicores in Kubernetes kubernetes kubernetes

Multi threading with Millicores in Kubernetes


It's best to see millicores as a way to express fractions, x millicores correspond to the fraction x/1000 (e.g. 250millicores = 250/1000 = 1/4).
The value 1 represent the complete usage of 1 core (or hardware thread if hyperthreading or any other SMT is enabled).

So 100mcpu means the process is using 1/10th of a single CPU time. This means that it is using 1 second out of 10, or 100ms out of a second or 10us out of 100.
Just take any unit of time, divide it into ten parts, the process is running only for one of them.
Of course, if you take a too short interval (say, 1us), the overhead of the scheduler becomes non-negligeable but that's not important.

If the value is above 1, then the process is using more than one CPU. A value of 2300mcpu means that out of, say, 10 seconds, the process is running for... 23!
This is used to mean that the process is using 2 whole CPUs and a 3/10 of a third one.
This may sound weird but it's no different to saying: "I work out 3.5 times a week" to mean that "I work out 7 days every 2 weeks".

Remember: millicores represent a fraction of CPU time not of CPU number. So 2300mcpu is 230% the time of a single CPU.


What I hate about technologies like Kubernetes and Docker is that they hide too much, confusing seasoned programmers.

The millicores unit arises, at its base, from the way the Linux scheduler works. It doesn't divide the time into quanta and assigns each thread the CPU for a quantum, instead, it runs a thread until it's unfair to keep it running. So a thread can run for a variable time.

The current Linux scheduler, named CFS, works with the concept of waiting time.
Each thread has a waiting time, a counter that is incremented each nanosecond (but any sufficiently fine unit of time will do) that the thread is waiting to execute and that is decremented each nanosecond the thread is executing.
The threads are then ordered by their wait time divided the total number of threads, the thread with the greatest wait time is picked up and run until its wait time (that now is decreasing) falls below the wait time of another thread (which will be then scheduled).
So if we have one core (without HyperThreading or any other SMT) and four threads, after, say, a second, the scheduler will have allocated 1/4 of that second (250ms) to each thread.
You can say that each thread used 250millicores. This means it uses 250/1000 = 1/4 of the core time on average. The "core time" can be any amount of time, granted it is far greater than the scheduler wallclock. So 250millicores means 1 minute of time every 4, or 2 days every 8.

When a system has multiple CPUs/cores, the waiting time is scaled to account for that.Now if a thread is scheduled, over the course of 1 second, to two CPUs for the whole second, we have an usage of 1/1 for the first CPU and 1/1 for the second one. A total of 1/1 + 1/1 = 2 or 2000mcpu.
This way of counting CPU times, albeit weird at first, at the advantage that it is absolute. 100mcpu means 1/10 of a CPU, no matter how many CPUs there are, this is by design.
If we counted time in a relative matter (i.e. where the value 1 means all the CPUs) then a value like 0.5 would mean 24 CPUs in a 48 CPUs system and 4 in an 8 CPUs system.
It would be hard to compare timings.

The Linux scheduler doesn't actually know about millicores, as we have seen it uses the waiting time and doesn't need any other measurement unit.
That millicores unit is just a unit we make up, so far, for our convenience.
However, it will turn out this unit will arise naturally due to how containers are constrained.

As implied by its name, the Linux scheduler is fair: all threads are equals. But you don't always want that, a process in a container should not hog all the cores on a machine.
This is where cgroups comes into play. It is a kernel feature that is used, along with namespace and union fs, to implement containers.
Its main goal is to restrict processes, including their CPU bandwidth.

This is done with two parameters, a period and a quota.
The restricted thread is allowed, by the scheduler, to run for quota microseconds (us) every period us.
Here, again, a quota greater than the period means using more than one CPU. Quoting the kernel documentation:

  1. Limit a group to 1 CPU worth of runtime. If period is 250ms and quota is also 250ms, the group will get 1 CPU worth of runtime every 250ms.

  2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine. With 500ms period and 1000ms quota, the group can get 2 CPUs worth of runtime every 500ms.

We see how, given x millicores, we can compute the quota and the period.
We can fix the period to 100ms and the quota to (100 * x) / 1000.
This is how Docker does it.
Of course, we have an infinite choice of pairs, we set the period to 100ms but indeed we can use any value (actually, there aren't infinite value but still).
Larger values of the period mean the thread can run for a longer time but will also pause for a longer time.
Here is where Docker is hiding things from the programmer, using an arbitrary value for the period in order to compute the quota (given the millicores, which the authors dub as more "user-friendly").

Kubernetes is designed around Docker (yes, it can use other container managers but they must expose an interface similar to the Docker's one), and the Kubernetes millicores unit match the unit used by Docker in its --cpus parameter.


So, long story short, millicores are the fractions of time of a single CPU (not the fraction of number of CPUs).
Cgroups, and hence Docker, and hence Kubernetes, doesn't restrict CPU usage by assigning cores to processes (like VMs do), instead it restricts CPU usage by restricting the amount of time (quota over period) the process can run on each CPU (with each CPU taking up to 1000mcpus worth of allowed time).


The scheduler of the kernel running the containers (f.e. linux) has means reserve time slices for an process to run concurrently with other processes on the same cpu.You can throttle a process - giving it less time slices - if it uses too much cpu. This happens then a (hard) limit is hit. You can schedule a pod to a different node, if the cpu requests exceed the available cpu resources on a node.So the requests is a hint for the kubernetes scheduler how to optimally place pods across nodes and the limit is to ensure by the kernel scheduler that no more resources will actually be used.Actually if you just configure requests and no limits, all pods will be scheduled by the kernel scheduler policy, which is trying to be fair and balance the resources across all processes to maximize the usage while not starving any single process.