Tracking down mysterious high-priority thread suspend inside the kernel Tracking down mysterious high-priority thread suspend inside the kernel multithreading multithreading

Tracking down mysterious high-priority thread suspend inside the kernel


Just a wild guess since the is no any answer yet..You say the system is multicore. Do you assign the affinity for the user thread to run on the same core where the interrupt occurs? And does the interrupt occur on the specific core only? I suspect a situation when a user thread runs on one core but the interrupt happens on the other one and cannot resume here immediately (not sleeping yet?). Probably a data race allows it to fall asleep e.g. just before the interrupt handler publishes some data which the thread polls. Thus, it is suspended until the next system interrupt (e.g. timer).

So, try to assign the interrupt and the thread to the same core in order to serialize them and avoid potential data races.

in response to update #1

Looks like I was right about the data race between the cores since raising IRQ on the target core fixes the issue. I guess it is not in the kernel code because of excessive reschedule IRQs and so additional scheduling overheads just for sake of very rare cases or just because it can be done faster using usual synchronization assuming the shared cache.

And there is some synchronization which looks like the right direction but apparently it misses something. I'd try running a reproducer on a different architecture/versions to understand whether it is a general bug or just specific to your platform/kernel versions. I hope it's not a missing fence on the p->on_cpu load/store..

Anyway, returning to your specific problem, if you cannot or don't want to use the custom kernel build with your hot-fix, my suggestion with thread affinity still stands actual and valid.

Additionally, if you cannot pin the interrupt to one particular core, you may want to run such a polling thread on each core (also explicitly pinned to it) to ensure that at least one of the threads will get the event immediately after the IRQ. Of course, it leads to additional synchronization burden on the user thread code.


We wound up going with the following fixes:

  • smp_send_reschedule(task_cpu(p)); mentioned above in the scheduler to allow cross-CPU prevention. I'll follow up with a maintainer to see if it's the correct fix.

  • Implementing get_user_pages_fast() for our platform that does not lock mmap_sem if it doesn't have to. This removed contention between mmap/munmap and futex_wait

  • Switching to vfork() + execve() in a couple of places in userspace code where fork() was unnecessary. This removed contention between mmap/munmap and calls that spawn other processes.

It seems like everything is running smoothly now.

Thank you for all your help.