Recently I got 2 dumps of a resource intensive process. The customer complained about hangs in web UI so the application had been killed and restarted numerous times. Quick WinDbg analysis spotted thousands of working threads in the pool:
0:000> !ThreadPool CPU utilization: 6% Worker Thread: Total: 6304 Running: 6303 Idle: 1 MaxLimit: 12000 MinLimit: 24 Work Request in Queue: 0 -------------------------------------- Number of Timers: 2 -------------------------------------- Completion Port Thread:Total: 2 Free: 1 MaxFree: 48 CurrentLimit: 1 MaxLimit: 12000 MinLimit: 24
Most of the threads wait for ReaderWriterLockSlim
read lock on ManualResetEvent
instance:
System.Threading.WaitHandle.WaitOneNative(System.Runtime.InteropServices.SafeHandle, UInt32, Boolean, Boolean) System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle, Int64, Boolean, Boolean) System.Threading.ReaderWriterLockSlim.WaitOnEvent(System.Threading.EventWaitHandle, UInt32 ByRef, TimeoutTracker) System.Threading.ReaderWriterLockSlim.TryEnterReadLockCore(TimeoutTracker) System.Threading.ReaderWriterLockSlim.TryEnterReadLock(TimeoutTracker)
One thread was waiting for write lock on the same object. No other stacks observed executing while holding the lock, all lock usages seemed proper:
s.EnterXXXLock(); try { // Do the job } finally { s.ExitXXXLock(); }
Yet the process is fucked up. What the hell is wrong here? Well, sometimes things get very complicated...
Lets take a look on reader writer lock instance:
0:3444> !do 0x0000000001affe60 Name: System.Threading.ReaderWriterLockSlim MethodTable: 000007f87a91c1a8 EEClass: 000007f87a639448 Size: 96(0x60) bytes File: C:\Windows\Microsoft.Net\assembly\GAC_MSIL\System.Core\v4.0_4.0.0.0__b77a5c561934e089\System.Core.dll Fields: MT Field Offset Type VT Attr Value Name 000007f8802fc7b8 4000755 50 System.Boolean 1 instance 1 fIsReentrant 000007f8802fdc90 4000756 30 System.Int32 1 instance 0 myLock 000007f8802f1ed0 4000757 34 System.UInt32 1 instance 1 numWriteWaiters 000007f8802f1ed0 4000758 38 System.UInt32 1 instance 6293 numReadWaiters 000007f8802f1ed0 4000759 3c System.UInt32 1 instance 0 numWriteUpgradeWaiters 000007f8802f1ed0 400075a 40 System.UInt32 1 instance 0 numUpgradeWaiters 000007f8802fc7b8 400075b 51 System.Boolean 1 instance 0 fNoWaiters 000007f8802fdc90 400075c 44 System.Int32 1 instance -1 upgradeLockOwnerId 000007f8802fdc90 400075d 48 System.Int32 1 instance -1 writeLockOwnerId 000007f8802f8d00 400075e 8 ...g.EventWaitHandle 0 instance 00000000f8e8f9c0 writeEvent 000007f8802f8d00 400075f 10 ...g.EventWaitHandle 0 instance 00000000fa23f040 readEvent 000007f8802f8d00 4000760 18 ...g.EventWaitHandle 0 instance 0000000000000000 upgradeEvent 000007f8802f8d00 4000761 20 ...g.EventWaitHandle 0 instance 0000000000000000 waitUpgradeEvent 000007f88030ff60 4000763 28 System.Int64 1 instance 9 lockID 000007f8802fc7b8 4000765 52 System.Boolean 1 instance 0 fUpgradeThreadHoldingRead 000007f8802f1ed0 4000766 4c System.UInt32 1 instance 1073741824 owners 000007f8802fc7b8 4000767 53 System.Boolean 1 instance 0 fDisposed 000007f88030ff60 4000762 408 System.Int64 1 static 17381 s_nextLockID 000007f87a9399f0 4000764 8 ...ReaderWriterCount 0 TLstatic t_rwc >> Thread:Value c18:0000000001917410 d18:00000000025a51c8 e54:000000000245d5f0 e90:0000000000000000 e20:00000000f90a6ce8 [>6000 more values]
The most valuable information is the owners
field:
0:000> ? 0n1073741824 Evaluate expression: 1073741824 = 00000000`40000000
And heres what it means:
//The uint, that contains info like if the writer lock is held, num of //readers etc. uint owners; //Various R/W masks //Note: //The Uint is divided as follows: // //Writer-Owned Waiting-Writers Waiting Upgraders Num-REaders // 31 30 29 28.......0 // //Dividing the uint, allows to vastly simplify logic for checking if a //reader should go in etc. Setting the writer bit, will automatically //make the value of the uint much larger than the max num of readers //allowed, thus causing the check for max_readers to fail. private const uint WRITER_HELD = 0x80000000; private const uint WAITING_WRITERS = 0x40000000; private const uint WAITING_UPGRADER = 0x20000000;
So, we are waiting for writers. Hold on, there are no writers! The lock is not held. Conslusion - the lock state is corrupted and could never recover. This is called orphaned lock.
The only thing (I am aware of) might have caused the orphan - asynchronous thread aborts. If a thread is interrupted while taking a lock via [Try]EnterXXXLock
method - we might come into described problem since those methods are not atomic. In my case thread aborts are triggered by WCF runtime (or perhaps Http runtime,
it doesn't matter).
Heres a simple code to simulate the situation:
using System; using System.Threading; namespace CLRInv { internal class Program { private static readonly ReaderWriterLockSlim rwl = new ReaderWriterLockSlim(LockRecursionPolicy.SupportsRecursion); private static void Main(string[] args) { rwl.EnterReadLock(); do { rwl.ExitReadLock(); var reader = new Thread(UseLockForRead); var writer = new Thread(UseLockForWrite); reader.Start(); writer.Start(); Thread.Sleep(TimeSpan.FromSeconds(2)); writer.Abort(); reader.Abort(); reader.Join(); writer.Join(); } while (rwl.TryEnterReadLock(TimeSpan.FromSeconds(10))); Console.WriteLine("Gotcha!"); // Forever young rwl.EnterWriteLock(); } private static void UseLockForRead() { try { for (;;) { rwl.EnterReadLock(); try { } finally { rwl.ExitReadLock(); } } } catch (ThreadAbortException) { } } private static void UseLockForWrite() { try { for (;;) { rwl.EnterWriteLock(); try { } finally { rwl.ExitWriteLock(); } } } catch (ThreadAbortException) { } } } }
The conclusion is not very optimistic - you can't use slim locks the way you normally use em if your application experiences timeouts and consequent thread aborts. Does this mean slim locks should be banned? Well, no. You just need to ensure special constructions are used to take and release locks.
First of all we need to prevent async aborts while executing [Try]EnterXXXLock
. How to do that? You must take the lock inside so called protected
region
. Here they mention a protected region
of code, such as a catch block, finally block, or constrained execution region
. This basically means ThreadAbortExeption
can't be thrown asynchronously
while executing except
and finally
blocks of a try
statement. So our [Try]EnterXXXLock
should be wrapped like
this:
try {} finally { rw.EnterXXXLock(); }
Weird? No, if you have .NET BCL source code. There are tonns of empty try
blocks with excessive comments:
// prevent ThreadAbort while updating state try { } finally { ... }
Proper slim lock usage turns out to be the following construction:
var lockIsHeld = false; try { try { } finally { rwl.EnterReadLock(); lockIsHeld = true; } // Do work here } finally { if (lockIsHeld) { rwl.ExitReadLock(); } }
Asynchronous ThreadAbortException
is thrown either before lock is held or after lock is held making finally unlock the object if it has been locked.
Two things I havent studied yet - is it possible to observe the following situation:
try { // <-- Could it happen here, before finally block is run but after try has opened fault clause region? try { } finally { // Lock } // Use resource } finally { // Unlock }
Thats why I used that condition flag to ensure the lock is held.
And the second one:
try { } finally { // Lock } try { // Use resource } finally { // Unlock }
Is this one safe? Probably yes.
The bottom line is know your runtime environment, don't use new features cause they are cool or Mr. Jeff has fresh stuff in his brand new book you love so much. Or hire a professional like me [:-D].