Subtle bugs in Reader/Writer locks do not surprise me. I worked on an in-house implementation based on Win32 (before C++11 and std::shared_mutex) and my recollection is that although the implementation sounds simple it is exceedingly easy to make subtle mistakes.
The experience left me with such a bad feeling for shared locks that I tend to avoid them unless absolutely required. When I last tested std::shared_mutex, performance was so poor compared to std::mutex that double-buffering the data protected by a simple mutex was much faster.
Agreed. I avoid reader-writer locks unless absolutely required and benchmarks prove it worthwhile.
Their usage often fails to outperform a regular lock due to additional overhead. They seem to make sense only in specific high-contention scenarios where arrival rate is high and/or the critical section has a long duration [1].
SRWLock seems like a situation where it's small enough for you to construct proofs that what you did is correct, and important enough that the enormous expense of such proof is justified.
The C++ std::mutex provided in Microsoft's STL is ludicrously big, whereas SRWLocks are the same size as a pointer. In one sense being ludicrously big is great - less risk the std::mutex will mistakenly share a cache line with your unrelated data, but for most people this makes std::mutex annoyingly expensive.
Maybe my memory is faulty but I believe it was analyzed by Leslie Lamports stuff. Ofcourse your building a model of how it should work and you might have faults in that.
C++ std::mutex provided in Microsoft's STL uses SRWLocks underneath, same size as a pointer.
Sadly, SRWLocks have yet another bug, they are very unfair. Write a loop which locks std::mutex inside the body, and no other thread will be able to grab the mutex despite the loop repeatedly releases then re-acquires the mutex. Critical sections are way more fair.
Some time ago I did some test, 256 threads competing on a small number of cache lines, and found out that all, CreateMutex, CRITICAL_SECTION and SRWLOCK, were quite fair.
The most successful thread was only 25%, 15% and 9% ahead of the least successful one.
On the contrary, in my simple usermode spinlock the unfairness would be 1000% or even 2000%.
It has been my experience that anything related to concurrency can be full of subtle edge cases. I tend to avoid it completely unless absolutely necessary.
they have slightly worse performance compared to MS, when high contention, but in test with this OP case - work well. in my test i not found bugs in implementation, but of course can not be sure that they not exist, very complex logic really. nobody test this, however very simply replace SRW calls to my implementation, by macros in h file
True, but in my experience the overhead of std::shared_mutex outweighs the benefits. Other approaches include:
* breaking up the lock so different threads can access different parts of your data structure concurrently.
* double-buffering (also called ping-pong buffers) where you effectively keep two copies of your data structure. The readers can access one without blocking, a single writer can modify the other and then swap.
* just accepting that reads will block each other with a std::mutex and work on minimizing the amount of time spent in the lock. This can actually work out quicker depending on your access patterns.
As always, careful profiling with real data is required to figure out what is better.
No, you are right. One solution is that the readers access the buffer through a shared_ptr and can hold onto the old version for as long as they need it while the writer makes changes and creates a wholly new data structure. It is also possible for the writer to block new readers while doing the swap. Tradeoffs everywhere, depending on if you need readers to see changes that occurred after they started accessing the data.
The experience left me with such a bad feeling for shared locks that I tend to avoid them unless absolutely required. When I last tested std::shared_mutex, performance was so poor compared to std::mutex that double-buffering the data protected by a simple mutex was much faster.
This was a great post by the original Redditor.