How to prioritize (or set scheduling policy for) the 'manager' and 'worker' threads of a process? How to prioritize (or set scheduling policy for) the 'manager' and 'worker' threads of a process? multithreading multithreading

How to prioritize (or set scheduling policy for) the 'manager' and 'worker' threads of a process?


UPD 12.02.2015: I have run some experiments.

Theory

There is obvious solution to change "manager" threads scheduler to RT (real-time scheduler that provides SCHED_DEADLINE/SCHED_FIFO policies). In this case "manager" threads will always have larger priority than most threads in a system, so they will almost always get CPU when they need it.

However, there is another solution that allows you to stay on CFS scheduler. Your description of purpose of "worker" threads is similiar to batch scheduling (in ancient times when computers were large, user has to put his job onto queue and wait hours till its done). Linux CFS supports batch jobs via SCHED_BATCH policy and dialog jobs via SCHED_NORMAL policy.

There is also useful comment in kernel code (kernel/sched/fair.c):

/* * Batch and idle tasks do not preempt non-idle tasks (their preemption * is driven by the tick): */if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))    return;

So when "manager" thread or some other event awake "worker", latter will get CPU only if there are free CPUs in system or when "manager" will exhaust its timeslice (to tune it change the weight of task).

It seems that your problem couldn't be solved without changing of scheduler policies. If "worker" threads are very busy and "manager" are rarely wake up, they would get same vruntime bonus, so "worker" would always preempt "manager" threads (but you may increase their weight, so they would exhaust their bonus faster).

Experiment

I have a server with 2 x Intel Xeon E5-2420 CPUs which gives us 24 hardware threads. To simulate two threadpools I used my own TSLoad workload generator (and fixed couple of bugs while running experiments :) ).

There were two threadpools: tp_manager with 4 threads and tp_worker with 30 threads, both running busy_wait workloads (just for(i = 0; i < N; ++i);) but with different number of loop cycles. tp_worker works in benchmark mode so it will run as many requests as it can and occupies 100% of CPU.

Here are sample config: https://gist.github.com/myaut/ad946e89cb56b0d4acde

3.12 (vanilla with debug config)

EXP  |              MANAGER              |     WORKER     |  sched            wait    service | sched            service     |  policy           time     time   | policy            time33   |  NORMAL          0.045    2.620   |     WAS NOT RUNNING34   |  NORMAL          0.131    4.007   | NORMAL           125.19235   |  NORMAL          0.123    4.007   | BATCH            125.14336   |  NORMAL          0.026    4.007   | BATCH (nice=10)  125.29637   |  NORMAL          0.025    3.978   | BATCH (nice=19)  125.22338   |  FIFO (prio=9)  -0.022    3.991   | NORMAL           125.18739   |  core:0:0        0.037    2.929   | !core:0:0        136.719

3.2 (stock Debian)

EXP  |              MANAGER              |     WORKER     |  sched            wait    service | sched            service     |  policy           time     time   | policy            time46   |  NORMAL          0.032    2.589   |     WAS NOT RUNNING45   |  NORMAL          0.081    4.001   | NORMAL           125.14047   |  NORMAL          0.048    3.998   | BATCH            125.20550   |  NORMAL          0.023    3.994   | BATCH (nice=10)  125.20248   |  NORMAL          0.033    3.996   | BATCH (nice=19)  125.22342   |  FIFO (prio=9)  -0.008    4.016   | NORMAL           125.11039   |  core:0:0        0.035    2.930   | !core:0:0        135.990

Some notes:

  • All times are in milliseconds
  • Last experiment is for setting affinities (advised by @PhilippClaßen): manager threads was bound to Core #0 while worker threads was bound to all cores except Core #0.
  • Service time for manager threads increased two times, which is explainable by concurrency inside cores (processor has Hyper-Threading!)
  • Using SCHED_BATCH + nice (TSLoad cannot set directly weight, but nice can do it indirectly) slightly reduces wait time.
  • Negative wait time in SCHED_FIFO experiment is OK: TSLoad reserves 30us so it can do preliminary work / scheduler have time to do context switch / etc.. It seem that SCHED_FIFO is very fast.
  • Reserving single core isn't that bad, and because it removed in-core concurrency, service time was significantly decreased


In addition to myaut's answer, you could also bind the manager to specific CPUs (sched_setaffinity) and the workers to the rest. Depending on your exact use case that can be very wasteful, of course.

Link: Thread binding the CPU core

Explicit yielding is generally not necessary, in fact often discouraged. To quote Robert Love in "Linux System Programming":

In practice, there are few legitimate uses of sched_yield() on a proper preemptive multitasking system such as Linux. The kernel is fully capable of making the optimal and most efficient scheduling decisions - certainly, the kernel is better equipped than an individual application to decide what to preempt and when.

The exception that he mentions, is when you are waiting on external events, for example, caused by the user, hardware or by another process. That is not the case, in your example.


An addition to myaut's excellent answer is to consider trying a kernel with the CONFIG_PREEMPT_RT patch set applied. This makes some fairly heavy-weight changes to how the kernel does scheduling, the net result being that scheduling latency becomes a lot more deterministic.

Used in combination with getting relative thread priorities correct (managers > workers) with either of myaut's suggestion (and especially with SCHED_FIFO) can yield very good results.