июля 2013

Recently I got 2 dumps of a resource intensive process. The customer complained about hangs in web UI so the application had been killed and restarted numerous times. Quick WinDbg analysis spotted thousands of working threads in the pool:

0:000> !ThreadPool
CPU utilization: 6%
Worker Thread: Total: 6304 Running: 6303 Idle: 1 MaxLimit: 12000 MinLimit: 24
Work Request in Queue: 0
--------------------------------------
Number of Timers: 2
--------------------------------------
Completion Port Thread:Total: 2 Free: 1 MaxFree: 48 CurrentLimit: 1 MaxLimit: 12000 MinLimit: 24

Most of the threads wait for ReaderWriterLockSlim read lock on ManualResetEvent instance:

System.Threading.WaitHandle.WaitOneNative(System.Runtime.InteropServices.SafeHandle, UInt32, Boolean, Boolean)
System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle, Int64, Boolean, Boolean)
System.Threading.ReaderWriterLockSlim.WaitOnEvent(System.Threading.EventWaitHandle, UInt32 ByRef, TimeoutTracker)
System.Threading.ReaderWriterLockSlim.TryEnterReadLockCore(TimeoutTracker)
System.Threading.ReaderWriterLockSlim.TryEnterReadLock(TimeoutTracker)

One thread was waiting for write lock on the same object. No other stacks observed executing while holding the lock, all lock usages seemed proper:

s.EnterXXXLock();
try
{
   // Do the job
}
finally
{
   s.ExitXXXLock();
}

Yet the process is fucked up. What the hell is wrong here? Well, sometimes things get very complicated...

Lets take a look on reader writer lock instance:

0:3444> !do 0x0000000001affe60
Name:        System.Threading.ReaderWriterLockSlim
MethodTable: 000007f87a91c1a8
EEClass:     000007f87a639448
Size:        96(0x60) bytes
File:        C:\Windows\Microsoft.Net\assembly\GAC_MSIL\System.Core\v4.0_4.0.0.0__b77a5c561934e089\System.Core.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
000007f8802fc7b8  4000755       50       System.Boolean  1 instance                1 fIsReentrant
000007f8802fdc90  4000756       30         System.Int32  1 instance                0 myLock
000007f8802f1ed0  4000757       34        System.UInt32  1 instance                1 numWriteWaiters
000007f8802f1ed0  4000758       38        System.UInt32  1 instance             6293 numReadWaiters
000007f8802f1ed0  4000759       3c        System.UInt32  1 instance                0 numWriteUpgradeWaiters
000007f8802f1ed0  400075a       40        System.UInt32  1 instance                0 numUpgradeWaiters
000007f8802fc7b8  400075b       51       System.Boolean  1 instance                0 fNoWaiters
000007f8802fdc90  400075c       44         System.Int32  1 instance               -1 upgradeLockOwnerId
000007f8802fdc90  400075d       48         System.Int32  1 instance               -1 writeLockOwnerId
000007f8802f8d00  400075e        8 ...g.EventWaitHandle  0 instance 00000000f8e8f9c0 writeEvent
000007f8802f8d00  400075f       10 ...g.EventWaitHandle  0 instance 00000000fa23f040 readEvent
000007f8802f8d00  4000760       18 ...g.EventWaitHandle  0 instance 0000000000000000 upgradeEvent
000007f8802f8d00  4000761       20 ...g.EventWaitHandle  0 instance 0000000000000000 waitUpgradeEvent
000007f88030ff60  4000763       28         System.Int64  1 instance 9 lockID
000007f8802fc7b8  4000765       52       System.Boolean  1 instance                0 fUpgradeThreadHoldingRead
000007f8802f1ed0  4000766       4c        System.UInt32  1 instance       1073741824 owners
000007f8802fc7b8  4000767       53       System.Boolean  1 instance                0 fDisposed
000007f88030ff60  4000762      408         System.Int64  1   static 17381 s_nextLockID
000007f87a9399f0  4000764        8 ...ReaderWriterCount  0 TLstatic  t_rwc
    >> Thread:Value c18:0000000001917410 d18:00000000025a51c8 e54:000000000245d5f0 e90:0000000000000000 e20:00000000f90a6ce8 [>6000 more values]

The most valuable information is the owners field:

0:000> ? 0n1073741824
Evaluate expression: 1073741824 = 00000000`40000000

And heres what it means:

//The uint, that contains info like if the writer lock is held, num of
//readers etc. 
uint owners; 

//Various R/W masks 
//Note:
//The Uint is divided as follows:
//
//Writer-Owned  Waiting-Writers   Waiting Upgraders     Num-REaders 
//    31          30                 29                 28.......0
// 
//Dividing the uint, allows to vastly simplify logic for checking if a 
//reader should go in etc. Setting the writer bit, will automatically
//make the value of the uint much larger than the max num of readers 
//allowed, thus causing the check for max_readers to fail.

private const uint WRITER_HELD = 0x80000000;
private const uint WAITING_WRITERS = 0x40000000; 
private const uint WAITING_UPGRADER = 0x20000000;

So, we are waiting for writers. Hold on, there are no writers! The lock is not held. Conslusion - the lock state is corrupted and could never recover. This is called orphaned lock.

The only thing (I am aware of) might have caused the orphan - asynchronous thread aborts. If a thread is interrupted while taking a lock via [Try]EnterXXXLock method - we might come into described problem since those methods are not atomic. In my case thread aborts are triggered by WCF runtime (or perhaps Http runtime, it doesn't matter).

Heres a simple code to simulate the situation:

using System;
using System.Threading;
 
namespace CLRInv
{
   internal class Program
   {
      private static readonly ReaderWriterLockSlim rwl = new ReaderWriterLockSlim(LockRecursionPolicy.SupportsRecursion);
 
      private static void Main(string[] args)
      {
         rwl.EnterReadLock();
         do {
            rwl.ExitReadLock();
 
            var reader = new Thread(UseLockForRead);
            var writer = new Thread(UseLockForWrite);
            reader.Start();
            writer.Start();
 
            Thread.Sleep(TimeSpan.FromSeconds(2));
            writer.Abort();
            reader.Abort();
 
            reader.Join();
            writer.Join();
         }
         while (rwl.TryEnterReadLock(TimeSpan.FromSeconds(10)));
 
         Console.WriteLine("Gotcha!");
 
         // Forever young
         rwl.EnterWriteLock();
      }
 
      private static void UseLockForRead()
      {
         try {
            for (;;) {
               rwl.EnterReadLock();
               try {
               }
               finally {
                  rwl.ExitReadLock();
               }
            }
         }
         catch (ThreadAbortException) {
         }
      }
 
      private static void UseLockForWrite()
      {
         try {
            for (;;) {
               rwl.EnterWriteLock();
               try {
               }
               finally {
                  rwl.ExitWriteLock();
               }
            }
         }
         catch (ThreadAbortException) {
         }
      }
   }
}

The conclusion is not very optimistic - you can't use slim locks the way you normally use em if your application experiences timeouts and consequent thread aborts. Does this mean slim locks should be banned? Well, no. You just need to ensure special constructions are used to take and release locks.

First of all we need to prevent async aborts while executing [Try]EnterXXXLock. How to do that? You must take the lock inside so called protected region. Here they mention a protected region of code, such as a catch block, finally block, or constrained execution region. This basically means ThreadAbortExeption can't be thrown asynchronously while executing except and finally blocks of a try statement. So our [Try]EnterXXXLock should be wrapped like this:

try {} finally { rw.EnterXXXLock(); }

Weird? No, if you have .NET BCL source code. There are tonns of empty try blocks with excessive comments:

// prevent ThreadAbort while updating state
try { } 
finally
{
...
}

Proper slim lock usage turns out to be the following construction:

var lockIsHeld = false;
try {
   try {
   }
   finally {
      rwl.EnterReadLock();
      lockIsHeld = true;
   }
 
   // Do work here
}
finally {
   if (lockIsHeld) {
      rwl.ExitReadLock();
   }
}

Asynchronous ThreadAbortException is thrown either before lock is held or after lock is held making finally unlock the object if it has been locked.

Two things I havent studied yet - is it possible to observe the following situation:

try {
   // <-- Could it happen here, before finally block is run but after try has opened fault clause region?
   try {
   }
   finally {
      // Lock
   }
 
   // Use resource
}
finally {
   // Unlock
}

Thats why I used that condition flag to ensure the lock is held.

And the second one:

try {
}
finally {
   // Lock
}
try {
   // Use resource
}
finally {
   // Unlock
}

Is this one safe? Probably yes.

The bottom line is know your runtime environment, don't use new features cause they are cool or Mr. Jeff has fresh stuff in his brand new book you love so much. Or hire a professional like me [:-D].

Hungry Mind , Blog about everything in IT - C#, Java, C++, .NET, Windows, WinAPI, ...

Windows Kernel Source code

Freddie Mercury - Living On My Own

A story of orphaned ReaderWriterLockSlim

Mortal Kombat 9 Komplete PC edition in steam

Архив блога

Просмотров за месяц

Обо мне

Постоянные читатели

Поиск