Is VC++ still broken Sequentially-Consistent-wise?
The 'volatile' keyword will prevent that kind of optimization. That's exactly what it's for: every use of 'a' will be read or written exactly as shown, and won't be moved in a different order to other volatile variables.
The implementation of the mutex should include compiler-specific instructions to cause a "fence" at that point, telling the optimizer not to reorder instructions across that boundary. Since the implementation is not from the compiler vendor, maybe that's left out? I've never checked.
Since 'a' is global, I would generally think the compiler would be more careful with it. But, VS10 doesn't know about threads so it won't consider that other threads will use it. Since the optimizer grasps the entire loop execution, it knows that functions called from within the loop won't touch 'a' and that's enough for it.
I'm not sure what the new standard says about thread visibility of global variables other than volatile. That is, is there a rule that would prevent that optimization (even though the function can be grasped all the way down so it knows other functions don't use the global, must it assume that other threads can) ?
I suggest trying the newer compiler with the compiler-provided std::mutex, and checking what the C++ standard and current drafts say about that. I think the above should help you know what to look for.
—John
Almost a month later, Microsoft still hasn't responded to the bug in MSDN Connect.
To summarize the above comments (and some further tests), apparently it happens in VS2013 professional as well, but the bug only happens when building for Win32, not for x64. The generated assembly code in x64 doesn't have this problem.So it appears that it is a bug in the optimizer, and that there's no race condition in this code.
Apparently this bug also happens in GCC 4.8.1, but not in GCC 4.9.(Thanks to Voo, nosid and Chris Dodd for all their testing).
It was suggested to mark a
as volatile
. This indeed prevents the bug, but only because it prevents the optimizer from performing the loop register optimization.
I found another solution: Add another local variable b
, and if needed (and under lock) do the following:
- Copy
a
intob
- Increment
b
in the loop - Copy back to
a
if needed
The optimizer replaces the local variable with a register, so the code is still optimized, but the copies from and to a
are done only if needed, and under lock.
Here's the new main()
code, with arrows marking the changed lines.
int main(int argc, char* argv[]){ bool work = (argc == 1); int b = 0; // <---- if (work) { m.lock(); b = a; // <---- } std::thread th(other); for (int i = 0; i < 100000000; ++i) { if (i % 7 == 3) { if (work) { ++b; // <---- } } } if (work) { a = b; // <---- std::cout << a << "\n"; m.unlock(); } th.join();}
And this is what the assembly code looks like (&a == 0x000744b0
, b
replaced with edi
):
21: int b = 0;00071473 xor edi,edi 22: 23: if (work)00071475 test bl,bl 00071477 je main+5Bh (07149Bh) 24: { 25: m.lock(); ........00071492 add esp,4 26: b = a;00071495 mov edi,dword ptr ds:[744B0h] 27: } 28: ........ 33: { 34: if (work)00071504 test bl,bl 00071506 je main+0C9h (071509h) 35: { 36: ++b;00071508 inc edi 30: for (int i = 0; i < 100000000; ++i)00071509 inc esi 0007150A cmp esi,5F5E100h 00071510 jl main+0A0h (0714E0h) 37: } 38: } 39: } 40: 41: if (work)00071512 test bl,bl 00071514 je main+10Ch (07154Ch) 42: { 43: a = b; 44: std::cout << a << "\n";00071516 mov ecx,dword ptr ds:[73084h] 0007151C push edi 0007151D mov dword ptr ds:[744B0h],edi 00071523 call dword ptr ds:[73070h] 00071529 mov ecx,eax 0007152B call std::operator<<<std::char_traits<char> > (071A80h) ........
This keeps the optimization and solves (or works around) the problem.