Hungry Mind , Blog about everything in IT - C#, Java, C++, .NET, Windows, WinAPI, ...

ReaderWriterLockSlim fails on dual-socket environments

This is yet another story of orphaned ReaderWriterLockSlim.

Another dump, the same problem - ReaderWriterLockSlim object state is corrupted:

0:173> !do 0x0000000001c679f8
Name:        System.Threading.ReaderWriterLockSlim
MethodTable: 000007f87ec7c1d8
EEClass:     000007f87e999448
Size:        96(0x60) bytes
File:        C:\Windows\Microsoft.Net\assembly\GAC_MSIL\System.Core\v4.0_4.0.0.0__b77a5c561934e089\System.Core.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
000007f880dbcbf0  4000755       50       System.Boolean  1 instance                1 fIsReentrant
000007f880dbe0c8  4000756       30         System.Int32  1 instance                0 myLock
000007f880db2308  4000757       34        System.UInt32  1 instance                1 numWriteWaiters
000007f880db2308  4000758       38        System.UInt32  1 instance               28 numReadWaiters
000007f880db2308  4000759       3c        System.UInt32  1 instance                0 numWriteUpgradeWaiters
000007f880db2308  400075a       40        System.UInt32  1 instance                0 numUpgradeWaiters
000007f880dbcbf0  400075b       51       System.Boolean  1 instance                0 fNoWaiters
000007f880dbe0c8  400075c       44         System.Int32  1 instance               -1 upgradeLockOwnerId
000007f880dbe0c8  400075d       48         System.Int32  1 instance               -1 writeLockOwnerId
000007f880db9138  400075e        8 ...g.EventWaitHandle  0 instance 000000000381eb38 writeEvent
000007f880db9138  400075f       10 ...g.EventWaitHandle  0 instance 00000000035a32e0 readEvent
000007f880db9138  4000760       18 ...g.EventWaitHandle  0 instance 0000000000000000 upgradeEvent
000007f880db9138  4000761       20 ...g.EventWaitHandle  0 instance 0000000000000000 waitUpgradeEvent
000007f880dd0398  4000763       28         System.Int64  1 instance 9 lockID
000007f880dbcbf0  4000765       52       System.Boolean  1 instance                0 fUpgradeThreadHoldingRead
000007f880db2308  4000766       4c        System.UInt32  1 instance       1073741824 owners
000007f880dbcbf0  4000767       53       System.Boolean  1 instance                0 fDisposed
000007f880dd0398  4000762      408         System.Int64  1   static 14118 s_nextLockID
000007f87ec99a20  4000764        8 ...ReaderWriterCount  0 TLstatic  t_rwc
0:173> .formats 0n1073741824 
Evaluate expression:
  Hex:     00000000`40000000

EnterReadLock, EnterWriteLock and other Enter operations waiting for an event which never goes off. Deadlock.

I must say that I checked possibilities of thread aborts in this code and found no signs of such scenarios happening. This made me desperately searching for another root cause of the problem.

So started searching ReaderWriterLockSlim.cs file for potential problems. I immediately became suspicious when I realyzed there is lack of synchronization when, for example, TryEnterUpgradeableReadLockCore method modified one of object fields:

uint owners;
...
private bool TryEnterReadLockCore(TimeoutTracker timeout)
{ 
...
       owners++;
}

Fields are not declared volatile, nor are they modified via interlocked operations. The only exception is the myLock field, which is used as a spin lock ang modified via Interlocked.CompareExchange:

[MethodImpl(MethodImplOptions.AggressiveInlining)] 
private void EnterMyLock()
{
   if (Interlocked.CompareExchange(ref myLock, 1, 0) != 0)
      EnterMyLockSpin(); 
}

Note, however, spin lock release method doesn't use Interlocked operation:

private void ExitMyLock()
{ 
   Debug.Assert(myLock != 0, "Exiting spin lock that is not held");
   myLock = 0; 
} 

This looks to be a mistake, possibly the root cause on one of root causes.

OK, lets go back to the problem - ReaderWriterLockSlim gets locked forever on 24-core dual socket Intel hardware. Threads are not aborted, the code is perfect. So what the hell is going on?

Well, the problem looks to be bad software (ReaderWriterLockSlim) on expensive hardware. Dell PowerEdge R720 has two psysical CPUs - 2x Intel Xeon E5-2620, 1200 MHz (12 x 100), 6 cores and 12 threads each. 24 logical cores total. And the problem is experienced only on such configurations.

I made a program that creates 24 (= Environment.ProcessorCount) threads with highest priority acquiring and releasing the lock in a tight loop:

using System;
using System.Collections.Generic;
using System.Runtime.CompilerServices;
using System.Threading;
namespace RWLSTest
{
   internal class Program
   {
      private static readonly ReaderWriterLockSlim slim = new ReaderWriterLockSlim(LockRecursionPolicy.SupportsRecursion);
      private static readonly List<object> objects = new List<object>();
      private static readonly Int32 processorCount = Environment.ProcessorCount;
      private static Int32 threadsCount;
      private static Int64 reads;
      private static Int64 writes;
      private static volatile Object[] threads = new Object[processorCount];
      private static Action loopAction;
      static Program()
      {
         // Let it JIT those methods
         using (var temp = new ReaderWriterLockSlim(LockRecursionPolicy.SupportsRecursion)) {
            Thread.Yield();
            temp.EnterReadLock();
            temp.ExitReadLock();
         }
         var thread = new Thread(() =>
         {
            try {
               Thread.Sleep(Timeout.Infinite);
            }
            catch {
               return;
            }
            throw new InvalidOperationException();
         });
         thread.Start();
         try {
            thread.Abort();
         }
         catch (Exception e) {
            Console.WriteLine(e.Message);
         }
      }
      private static void LoopWithEmptryTryBlocks()
      {
         var random = new Random(Environment.TickCount);
         for (;;) {
            if (random.Next(processorCount) <= (processorCount / 4)) {
               Interlocked.Increment(ref writes);
               try {}
               finally {
                  slim.EnterWriteLock();
               }
               try {
                  ExclusiveLoop(random);
               }
               finally {
                  slim.ExitWriteLock();
               }
            }
            else {
               Interlocked.Increment(ref reads);
               try {}
               finally {
                  slim.EnterReadLock();
               }
               try {
                  SharedLoop(random);
               }
               finally {
                  slim.ExitReadLock();
               }
            }
         }
      }
      [MethodImpl(MethodImplOptions.AggressiveInlining)]
      private static void SharedLoop(Random random)
      {
         foreach (var o in objects) {
            var i = (Int32)o;
            if ((i % processorCount) == (random.Next() % processorCount) && random.Next(37) == 3) {
               break;
            }
         }
      }
      [MethodImpl(MethodImplOptions.AggressiveInlining)]
      private static void ExclusiveLoop(Random random)
      {
         if (objects.Count < 10240) {
            for (var i = 0; i < 19; ++i) {
               if (random.Next(13) == 7) {
                  objects.Add(random.Next());
               }
            }
         }
         for (var i = 0; i < 13; ++i) {
            if (objects.Count > 0 && random.Next(19) == 13) {
               objects.Remove(random.Next() % objects.Count);
            }
         }
      }
      private static void Loop()
      {
         var random = new Random(Environment.TickCount);
         for (;;) {
            if (random.Next(processorCount) <= (processorCount / 4)) {
               slim.EnterWriteLock();
               try {
                  ExclusiveLoop(random);
               }
               finally {
                  slim.ExitWriteLock();
               }
            }
            else {
               slim.EnterReadLock();
               try {
                  SharedLoop(random);
               }
               finally {
                  slim.ExitReadLock();
               }
            }
         }
      }
      private static void StartOneThread(Object state)
      {
         var thread = new Thread(() =>
         {
            try {
               Interlocked.Increment(ref threadsCount);
               loopAction();
            }
            catch (ThreadAbortException) {}
            finally {
               Interlocked.Decrement(ref threadsCount);
               ThreadPool.UnsafeQueueUserWorkItem(StartOneThread, state);
            }
         }) { Priority = ThreadPriority.Highest };
         thread.Start();
         Thread.VolatileWrite(ref threads[(Int32)state], thread);
      }
      private static void Main(string[] args)
      {
         var random = new Random(Environment.TickCount);
         var abortCycle = 0;
         if (args.Length > 0) {
            abortCycle = Int32.Parse(args[0]);
            loopAction = LoopWithEmptryTryBlocks;
         }
         else {
            loopAction = Loop;
         }
         for (var i = 0; i < processorCount; ++i) {
            StartOneThread(i);
         }
         for (var i = 0U;; ++i) {
            Thread.Sleep(1);
            if (abortCycle > 0 && i % abortCycle == 0) {
               var ti = random.Next(111) % processorCount;
               var thread = (Thread)Thread.VolatileRead(ref threads[ti]);
               if (thread != null) {
                  Console.WriteLine("Aborting thread #" + ti);
                  try {
                     thread.Abort();
                  }
                  catch (Exception e) {
                     Console.WriteLine(e.Message);
                  }
               }
            }
         }
      }
   }
}

I ran it several times and after about 1 hour all threads ended up waiting for lock event to fire. Voila! Have a look at the state of ReaderWriterLockSlim object:

0:000> !do 000000b343bd2860 
Name:        System.Threading.ReaderWriterLockSlim
MethodTable: 000007fbf887c1a8
EEClass:     000007fbf8599448
Size:        96(0x60) bytes
File:        C:\Windows\Microsoft.Net\assembly\GAC_MSIL\System.Core\v4.0_4.0.0.0__b77a5c561934e089\System.Core.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
000007fbfe6ac7b8  4000755       50       System.Boolean  1 instance                1 fIsReentrant
000007fbfe6adc90  4000756       30         System.Int32  1 instance                0 myLock
000007fbfe6a1ed0  4000757       34        System.UInt32  1 instance                1 numWriteWaiters
000007fbfe6a1ed0  4000758       38        System.UInt32  1 instance               23 numReadWaiters
000007fbfe6a1ed0  4000759       3c        System.UInt32  1 instance                0 numWriteUpgradeWaiters
000007fbfe6a1ed0  400075a       40        System.UInt32  1 instance                0 numUpgradeWaiters
000007fbfe6ac7b8  400075b       51       System.Boolean  1 instance                0 fNoWaiters
000007fbfe6adc90  400075c       44         System.Int32  1 instance               -1 upgradeLockOwnerId
000007fbfe6adc90  400075d       48         System.Int32  1 instance               -1 writeLockOwnerId
000007fbfe6a8d00  400075e        8 ...g.EventWaitHandle  0 instance 000000b343beb448 writeEvent
000007fbfe6a8d00  400075f       10 ...g.EventWaitHandle  0 instance 000000b343be2fd0 readEvent
000007fbfe6a8d00  4000760       18 ...g.EventWaitHandle  0 instance 0000000000000000 upgradeEvent
000007fbfe6a8d00  4000761       20 ...g.EventWaitHandle  0 instance 0000000000000000 waitUpgradeEvent
000007fbfe6bff60  4000763       28         System.Int64  1 instance 1 lockID
000007fbfe6ac7b8  4000765       52       System.Boolean  1 instance                0 fUpgradeThreadHoldingRead
000007fbfe6a1ed0  4000766       4c        System.UInt32  1 instance       1073741824 owners
000007fbfe6ac7b8  4000767       53       System.Boolean  1 instance                0 fDisposed
000007fbfe6bff60  4000762      408         System.Int64  1   static 2 s_nextLockID
000007fbf88999f0  4000764        8 ...ReaderWriterCount  0 TLstatic  t_rwc

There are 23 reader waiters, 1 writer waiter and owners field is 0x40000000 once again. All of 24 threads look like the following:

0:000> ~22e !CLRStack
OS Thread Id: 0xf28 (22)
        Child SP               IP Call Site
000000b361a1df78 000007fc137b315b [HelperMethodFrame_1OBJ: 000000b361a1df78] System.Threading.WaitHandle.WaitOneNative(System.Runtime.InteropServices.SafeHandle, UInt32, Boolean, Boolean)
000000b361a1e0a0 000007fbfe5195c4 System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle, Int64, Boolean, Boolean)
000000b361a1e0e0 000007fbf8af4c25 System.Threading.ReaderWriterLockSlim.WaitOnEvent(System.Threading.EventWaitHandle, UInt32 ByRef, TimeoutTracker)
000000b361a1e150 000007fbf8dd4c48 System.Threading.ReaderWriterLockSlim.TryEnterReadLockCore(TimeoutTracker)
000000b361a1e1b0 000007fbf8804d4a System.Threading.ReaderWriterLockSlim.TryEnterReadLock(TimeoutTracker)
000000b361a1e200 000007fbf8af55ad System.Threading.ReaderWriterLockSlim.TryEnterReadLock(Int32)
000000b361a1e250 000007fba0010a45 RWLSTest.Program.Loop()
000000b361a1e2c0 000007fba00106f7 RWLSTest.Program+<>c__DisplayClass4.<startonethread>b__3()
000000b361a1e330 000007fbfe4ff8a5 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
000000b361a1e490 000007fbfe4ff609 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
000000b361a1e4c0 000007fbfe4ff5c7 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
000000b361a1e510 000007fbfe512d21 System.Threading.ThreadHelper.ThreadStart()
000000b361a1e828 000007fbff6bf713 [GCFrame: 000000b361a1e828] 
000000b361a1eb58 000007fbff6bf713 [DebuggerU2MCatchHandlerFrame: 000000b361a1eb58] 

They all WaitOnEvent, but who and when will fire the event? This never happens. Deadlock.

Now lets get back to ExitMyLock. ReaderWriterLockSlim contains many fields with at least 88 bytes storage space required (without extra aligning, if needed). Modern Intel CPUs have cache lines of 64 bytes which is too small to entirely hold ReaderWriterLockSlim object instance. So each one requires at least two cache lines to hold its data. Since the distance between myLock and owners fields is more than 64 bytes (both x86 and x64), releasing myLock without a memory barrier (or interlocked instruction) causes only a portion of object's storage invalidated on demand between CPU cores and/or CPUs. Invalidation is forced by EnterMyLock's interlocked instruction. But only 64 bytes of aligned memory where myLock resides. Other cache line's changes might not be visible at that point. So the core acquiring the lock may see inconsistent object state.

Very important note: ReaderWriterLockSlim.cs is a part of 4.5Update1 reference source. Vanilla .NET 4.5 and probably several updates following it has this code, for example 4.0.30319.17929, 4.0.30319.18408. Recent versions, for example 4.0.30319.33440, has fixed this:

private void ExitMyLock()
{ 
   Volatile.Write(ref myLock, 0);
}

Volatile write inserts explicit memory barriers and makes any changes visible to other cores and CPUs.

Conclusion: do not use ReaderWriterLockSlim class without .NET Framework updated to at least 4.0.30319.33440. Its will eventually fail, at least on dual-socket Intel system.

Windows 8.1 and Windows Server 2012 R2 have this issue fixed. Windows Server 2012 (nor R2) seems to stuck with buggy implementation of ReaderWriterLockSlim class. After installing all available updates, ExitMyLock looks the same (no volatile write operation).

2 коммент.:

Анонимный комментирует...

Thank you for your article - this was a very enlightening read.

Анонимный комментирует...

Hi Andrii
Thank you for your article. It helped us troubleshoot an incident seen on a clients server.
However, we have observed the problem on .NET 4.5.2 We have reproduced the problem, using you code example, and verified the corrupted ReaderWriterLockSlim state using WinDbg. I have checked the loaded native version of System.Core and decompiled it and verified that Volatile.Write is used in ExitMyLock. So apparantly there are other circumstances for which this problem can occur? Any comment you may have is highly appreciated.
Best regards
Henrik

Отправить комментарий

Copyright 2007-2011 Chabster