Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest? Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest? multithreading multithreading

Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest?


Hypothesis: Method 2 is somehow blocking the update from getting written by the server.

One of the things you can hammer, besides the processor cores themselves, is your coherent cache. When you read a value on a given core, the L1 cache on that core has to acquire read access to that cache line, which means it needs to invalidate the write access to that line that any other cache has. And vice versa to write a value. So this means that you're continually ping-ponging the cache line back and forth between a "write" state (on the server-core's cache) and a "read" state (in the caches of all the client cores).

The intricacies of x86 cache performance are not something I am entirely familiar with, but it seems entirely plausible (at least in theory) that what you're doing by having three different threads hammering this one memory location as hard as they can with read-access requests is approximately creating a denial-of-service attack on the server preventing it from writing to that cache line for a few milliseconds on occasion.

You may be able to do an experiment to detect this by looking at how long it takes for the server to actually write the value into the update list, and see if there's a delay there corresponding to the latency.

You might also be able to try an experiment of removing cache from the equation, by running everything on a single core so the client and server threads are pulling things out of the same L1 cache.


I don't know if you have ever read the Concurrency columns from Herb Sutter. They are quite interesting, especially when you get into the cache issues.

Indeed the Method2 seems better here because the id being smaller than the data in general would mean that you don't have to do round-trips to the main memory too often (which is taxing).

However, what can actually happen is that you have such a line of cache:

Line of cache = [ID1, ID2, ID3, ID4, ...]                  ^         ^            client          server

Which then creates contention.

Here is Herb Sutter's article: Eliminate False Sharing. The basic idea is simply to artificially inflate your ID in the list so that it occupies one line of cache entirely.

Check out the other articles in the serie while you're at it. Perhaps you'll get some ideas. There's a nice lock-free circular buffer I think that could help for your update list :)


I've noticed in both method 1 and method 3 you have a line, ACQUIRE_MEMORY_BARRIER, which I assume has something to do with multi-threading/race conditions?

Either way, method 2 doesn't have any sleeps which means the following code...

while(true){       if(update_cursor->state_ == NOT_FILLED_YET) {        continue;    }

is going to hammer the processor. The typical way to do this kind of producer/consumer task is to use some kind of semaphore to signal to the reader that the update list has changed. A search for producer/consumer multi threading should give you a large number of examples. The main idea here is that this allows the thread to go to sleep while it's waiting for the update_cursor->state to change. This prevents this thread from stealing all the cpu cycles.