The first time I was ever introduced to the problem of thread pool starvation was by the great Sjøkki Gummi Gau, a close colleague of mine and an amazing developer. He called me up and asked me to guess how long a program would take to run. It was a very simple program; it just spawned 10 tasks, where each task would simply wait for a second. I laughed at him and told him of course it would only take ~one second, since they would be running in parallel. Imagine my surprise when it took a whole three seconds to complete.
What I experienced there is something called thread pool starvation. This is where the CLR does not immediately allocate a new thread to the application when the application wants to do some concurrent work. The CLR actually staggers releasing threads to the application after a certain number. It allocates one thread per processor, and the release rate after that seems to be 1-2 threads a second. This is usually a massive problem in older applications that don’t make use of the async/await pattern. I will show you why a bit later.
Our service was simply disconnecting people
A year later, one of our services was behaving very strangely. We where experiencing our services simply kicking people off after some timeout during peek hours. After some digging around, we noticed that that specific service had a massive amount of threads compared to all other services. Not only that, but it was continuously allocating more and more threads in a consistent rate. I remembered what Sjøk told me way back then and I tried to increase the number of threads that where available to the application from the get go. I achieved this by using the ThreadPool.SetMinThreads and just using some crazy high number. Bam. The steady incline change to massive spike, and the thread count skyrocketed. While I was not happy with the solution, it had bought us some time to figure out what was going on and come up with a more scalable and long term solution.
What the hell was going on?
We started digging and one of the first things we did was use a tool called Concurrency Visualizer. We were flabbergasted with what we saw.
98% of all the “work” that was being performed by our application was to block the current thread. Thread upon thread upon thread in a blocked state, where absolutely nothing was happening. No wonder the application was spawning more and more threads, it couldn’t use any of the existing ones, they were all occupied! While none of the threads where using any CPU resources, they where still being prevented from returning into the thread pool because they where blocked, causing thread pool starvation.
Async/await to the rescue!
I knew from my other conversations with Sjøk and Peter, my old team leader, that we could solve this problem with the async/await pattern. This is because by using async and await, you release the thread while you wait for the work to finish. This lets it be used for something else in the meantime, like taking care of another client. In order to demonstrate the magnitude of the problem, I would like to show you a slightly modified version of the program that introduced me to all this.
public async Task StarveThreadPool()
{
int taskCount = 1000;
await RunTasks(() => Task.Run(async () =>
await Task.Delay(TimeSpan.FromSeconds(1))), taskCount, "Async");
await RunTasks(() => Task.Run(() =>
Task.Delay(TimeSpan.FromSeconds(1)).Wait()), taskCount, "Non-async");
}
private async Task RunTasks(Func<Task> createTask, int taskNumber, string method)
{
var stopwatch = new Stopwatch();
stopwatch.Start();
var tasks = Enumerable
.Range(0, taskNumber)
.Select(_ => createTask())
.ToArray();
await Task.WhenAll(tasks);
stopwatch.Stop();
Console.WriteLine($"{method} run took {stopwatch.Elapsed.TotalSeconds} seconds");
}
This is a very, very simple program. Can you guess how long it will take to execute each call on line 5 and 8?
On my machine, the first one takes more or less a second. The second one takes more or less 50 seconds. That is an unbelievable difference. This happens because the thread pool manager cannot reuse any of the threads while they block for the duration of the delay. It slowly makes more threads available, while also waiting for the current ones to finish executing in order to reuse them. In the async version, the threads are made available again to the thread pool as soon as they are delayed. This means that in the async version, all of the 1000 tasks can delay concurrently, and all finish at almost the same time.
In a small note, the program did not behave at all how I expected it to if I used Task.Factory.StartNew instead of Task.Run. Can you guess why? Read on here for the answer.
Conclusion
More concurrency is not always a good thing. By adding more and more tasks to be run concurrently, you are most likely just slowing down your application.
In such cases, await and async are powerful allies in your fight against badly optimized (or de-optimized) code. It is not always easy to implement it, since asynchronous API’s are not always available. They should definitely be used if they are though.
I tried to use ConcurrencyVisualizer: I see the CPU percentages, but I don’t see how you were able to conclude that ‘98% of all the “work” that was being performed by our application was to block the current thread.’
If your Synchronization % was high, that does not represent % of total work, but rather time spent waiting, which does not use CPU according to https://stackoverflow.com/questions/8316984/is-thread-time-spent-in-synchronization-too-high. Or perhaps the S/O post is wrong, and CPU is wasted whenever threads are synchronized, in proportion to the “% Synchronization row”.
You are absolutely correct Markus! There is no actual work being performed by the CPU – that was why I used quotation marks around that word. I could have been a bit clearer. It still prevented the threads from being released back into the thread pool, causing thread pool starvation.
I have corrected the article to better reflect this, thank you for your feedback.