Profiling sleep times with perf Profiling sleep times with perf linux linux

Profiling sleep times with perf


There is a error message from your second perf command from https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times - perf inject -s

$ sudo perf inject -v -s -i ~/perf.data.raw -o ~/perf.databuild id event received for [kernel.kallsyms]: d62870685909222126e7070d2bafdf029f7ed3b6failed to write feature 2

failed to write feature 2 doesn't look too user-friendly...

... but it was added to perf to made errors more user-friendly: http://lwn.net/Articles/460520/ "perf: make perf.data more self-descriptive (v5)" by Stephane Eranian , 22 Sep 2011:

+static int do_write_feat(int fd, struct perf_header *h, int type,  ....+           pr_debug("failed to write feature %d\n", type);

All features are listed here http://lxr.free-electrons.com/source/tools/perf/util/header.h#L13

 15         HEADER_TRACING_DATA     = 1, 16         HEADER_BUILD_ID,

So, it sounds like perf inject was not able to write information about build ids (error from function write_build_id() from util/header.c) if I'm not wrong. There are two cases which can lead to error: unsuccessful call to perf_session__read_build_ids() or failing in writing buildid table dsos__write_buildid_table (this is not our case because there is no "failed to write buildid table" error message; check write_build_id)

You may check, do you have all buildids needed for the session. Also it may be useful to clear your buildid cache (rm -rf ~/.debug), and check that you have up-to-date vmlinux with debugging info or kallsyms enabled in your kernel.

UPDATE: in comments Pavel says that his pref record had no any sched:sched_stat_sleep events written to perf.data:

sudo perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -g -o ~/perf.data.raw ./a.out

As he explains in his answer, his default debian kernel have CONFIG_SCHEDSTATS option disabled with vendor's patch. The redhat did the same thing with the option in release kernels since 3.11, and this is explained in Redhat Bug 1013225 (Josh Boyer 2013-10-28, comment 4):

We switched to enabling that only on debug builds a while ago. It seems that was turned off entirely with the final 3.11.0 build and has remained off since. Internal testing shows the option has a non-trivial performance impact for context switches.

We can turn this on in debug kernels again, but I'm not sure it's worthwhile.

Josh Poimboeuf 2013-11-04 in comment 8 says that performance impact is detectable:

In my tests I did a lot of context switches under various CPU loads. I saw a ~5-10% drop in average context switch speed when CONFIG_SCHEDSTATS was enabled. ...The performance hit only seemed to happen on post-CFS kernels (>= 2.6.23). The previous O(1) scheduler didn't seem to have this issue.

Fedora disabled CONFIG_SCHEDSTAT in non-debug kernels at 12 July 2013 "[kernel] Disable LATENCYTOP/SCHEDSTATS in non-debug builds." by Dave Jones. First kernel with disabled option: 3.11.0-0.rc0.git6.4.

In order to use any perf software tracepoint event with name like sched:sched_stat_* (sched:sched_stat_wait, sched:sched_stat_sleep, sched:sched_stat_iowait) we must recompile kernel with CONFIG_SCHEDSTATS option enabled and replace default Debian, RedHat or Fedora kernels which have no this option.

Thank you, Pavel Davydov.


I finally found out how to make it work. The problem was that the default debian kernel is built without some config options, that perf needs to be able to monitor sleep times. It looks like CONFIG_SCHEDSTATS should be enabled to make kernel collect scheduler statistics. This is told to have some runtime overhead. Also I enabled CONFIG_SCHED_TRACER and some lock tracing options, but I'm not sure if they matter in my case. Anyway, no statistic data is collected in scheduler without CONFIG_SCHEDSTATS (see kernel/sched/ directory of kernel source).

Also, there is a very good article about perf written by Brendan Gregg, with a lot of usefull examples and some kernel options that are needed to make perf work properly.

Update: I checked the history of CONFIG_SCHEDSTATS in debian. I've checked out debian kernel patches and build scripts repo:

svn checkout svn://svn.debian.org/svn/kernel/dists/trunk/linux/debian

And then found CONFIG_SCHEDSTATS option there

$ grep -R CONFIG_SCHEDSTAT config/config/config:# CONFIG_SCHEDSTATS is not set

This string was added to the repo in commit 10837, on 2008-03-14, with comment "debian/config: Do complete reorganization". Also, in this and this (thanks to osgx) bug reports it is told that CONFIG_LATENCYTOP, CONFIG_SCHEDSTATS options are not enabled because they can affect kernel perfomance. So, I think it just was never switched on in default debian kernels. I haven't found the discussion about scheduler stats option, though. If I do, I will write back here.


This works for me for "perf version 3.11.1" on an "openSUSE 13.1 (x86_64)" box.

Here is the output if you care:

# ========# captured on: Sun Feb 16 09:49:38 2014# hostname : *****************# os release : 3.11.10-7-desktop# perf version : 3.11.1# arch : x86_64# nrcpus online : 8# nrcpus avail : 8# cpudesc : Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz# cpuid : GenuineIntel,6,58,9# total memory : 32945368 kB# cmdline : /usr/bin/perf inject -v -s -i perf.data.raw -o perf.data # event : name = sched:sched_stat_sleep, type = 2, config = 0x48, config1 = 0x0, config2 = 0x# event : name = sched:sched_switch, type = 2, config = 0x51, config1 = 0x0, config2 = 0x0, e# event : name = sched:sched_process_exit, type = 2, config = 0x4e, config1 = 0x0, config2 = # HEADER_CPU_TOPOLOGY info available, use -I to display# HEADER_NUMA_TOPOLOGY info available, use -I to display# pmu mappings: cpu = 4, software = 1, tracepoint = 2, uncore_cbox_0 = 6, uncore_cbox_1 = 7, # ========## Samples: 0  of event 'sched:sched_stat_sleep'# Event count (approx.): 0## Overhead        Period  Command  Shared Object  Symbol# ........  ............  .......  .............  ......## Samples: 8  of event 'sched:sched_switch'# Event count (approx.): 80099958776## Overhead        Period  Command      Shared Object             Symbol# ........  ............  .......  .................  .................#   100.00%   80099958776      bla  [kernel.kallsyms]  [k] thread_return                |                --- thread_return                    thread_return                    do_nanosleep                    hrtimer_nanosleep                    SyS_nanosleep                    system_call_fastpath                    0x7fbc0dec6570                    __GI___libc_nanosleep                    (nil)# Samples: 0  of event 'sched:sched_process_exit'# Event count (approx.): 0## Overhead        Period  Command  Shared Object  Symbol# ........  ............  .......  .............  ......### (For a higher level overview, try: perf report --sort comm,dso)#}