Measuring the latency of Unix domain sockets Measuring the latency of Unix domain sockets unix unix

Measuring the latency of Unix domain sockets


I'd guess that instruction-cache misses for the kernel code involved is a big part of the slowdown on the first time through. Probably also data cache misses for kernel data structures keeping track of stuff.

Lazy setup is a possibility, though.

You could test by doing a sleep(10) between trials (including before the first trial). Do something that will use all the CPU cache, like refresh a web page, between each trial. If it's lazy setup, then the first call will be extra slow. If not, then all calls will be equally slow when caches are cold.


In the linux kernel, you can find the ___sys_sendmsg function that gets used by send. Check here to view the code.

The function has to copy the user message (in your case the 8KB buf) from user space to kernel space. After that recv can copy back the received message from the kernel space to the user space of the child process.

That means you need to have 2 memcpy and one kmalloc for a send() recv() pair.

The first one is so special because the space where to store the user message is not allocated. This means also that it is not present in the data cache as well. so the first send() - recv() pair will allocate the kernel memory where to store buf and that will also get cached. The following calls will just use that memory using the used_address argument in the function's prototype.

So your assumption is correct. The first run allocates the 8KB in the kernel and uses cold caches while the others just use previously allocated and cached data.


It's not the data copy that takes 80 extra microseconds, that would be extremely slow (100 MB/s only), it's the fact that you're using two processes and that when the parent sends the data for the first time, these data need to wait for the child to finish to fork and start to execute.

If you absolutely want to use two processes, you should first perform a send in the other direction so that the parent can wait for the child to be ready beforestarting to send.

Eg:Child:

  send();  recv();  send();

Parent:

  recv();  gettime();  send();  recv();  gettime();

Also you need to realize that your test depends a lot on process placement on the various CPU cores and if run on the same core, will cause a task switch.

For this reason I would strongly recommend that you perform the measurementusing a single process. Even without poll nor anything, you can do it thisway provided that you keep reasonably small blocks which fit into socket buffers :

gettime();send();recv();gettime();

You should first perform a non-measured round trip to ensure buffers are allocated. I'm pretty sure you'll get much smaller times here.