Thread pool starvation and the power of async/await

This graph perfectly illustrates a symptom of thread pool starvation, something we recently had to deal with at work.
You have probably seen this if you have had to deal with thread pool starvation.

The first time I was ever introduced to the problem of thread pool starvation was by the great Sjøkki Gummi Gau, a close colleague of mine and an amazing developer. He called me up and asked me to guess how long a program would take to run. It was a very simple program; it just spawned 10 tasks, where each task would simply wait for a second. I laughed at him and told him of course it would only take ~one second, since they would be running in parallel. Imagine my surprise when it took a whole three seconds to complete.

What I experienced there is something called thread pool starvation. This is where the CLR does not immediately allocate a new thread to the application when the application wants to do some concurrent work. The CLR actually staggers releasing threads to the application after a certain number. It allocates one thread per processor, and the release rate after that seems to be 1-2 threads a second. This is usually a massive problem in older applications that don’t make use of the async/await pattern. I will show you why a bit later.

Our service was simply disconnecting people

A year later, one of our services was behaving very strangely. We where experiencing our services simply kicking people off after some timeout during peek hours. After some digging around, we noticed that that specific service had a massive amount of threads compared to all other services. Not only that, but it was continuously allocating more and more threads in a consistent rate. I remembered what Sjøk told me way back then and I tried to increase the number of threads that where available to the application from the get go. I achieved this by using the ThreadPool.SetMinThreads and just using some crazy high number. Bam. The steady incline change to massive spike, and the thread count skyrocketed. While I was not happy with the solution, it had bought us some time to figure out what was going on and come up with a more scalable and long term solution.

What the hell was going on?

We started digging and one of the first things we did was use a tool called Concurrency Visualizer. We were flabbergasted with what we saw.

98% of all the “work” that was being performed by our application was to block the current thread. Thread upon thread upon thread in a blocked state, where absolutely nothing was happening. No wonder the application was spawning more and more threads, it couldn’t use any of the existing ones, they were all occupied!

Async/await to the rescue!

I knew from my other conversations with Sjøk and Peter, my old team leader, that we could solve this problem with the async/await pattern. This is because by using async and await, you release the thread while you wait for the work to finish. This lets it be used for something else in the meantime, like taking care of another client. In order to demonstrate the magnitude of the problem, I would like to show you a slightly modified version of the program that introduced me to all this.

public async Task StarveThreadPool()
{
    int taskCount = 1000;

    await RunTasks(() => Task.Run(async () => 
        await Task.Delay(TimeSpan.FromSeconds(1))), taskCount, "Async");

    await RunTasks(() => Task.Run(() => 
        Task.Delay(TimeSpan.FromSeconds(1)).Wait()), taskCount, "Non-async");
}

private async Task RunTasks(Func<Task> createTask, int taskNumber, string method)
{
    var stopwatch = new Stopwatch();
    stopwatch.Start();

    var tasks = Enumerable
        .Range(0, taskNumber)
        .Select(_ => createTask())
        .ToArray();

    await Task.WhenAll(tasks);

    stopwatch.Stop();

    Console.WriteLine($"{method} run took {stopwatch.Elapsed.TotalSeconds} seconds");
}

This is a very, very simple program. Can you guess how long it will take to execute each call on line 5 and 8?

On my machine, the first one takes more or less a second. The second one takes more or less 50 seconds. That is an unbelievable difference. This happens because the thread pool manager cannot reuse any of the threads while they block for the duration of the delay. It slowly makes more threads available, while also waiting for the current ones to finish executing in order to reuse them. In the async version, the threads are made available again to the thread pool as soon as they are delayed. This means that in the async version, all of the 1000 tasks can delay concurrently, and all finish at almost the same time.

In a small note, the program did not behave at all how I expected it to if I used Task.Factory.StartNew instead of Task.Run. Can you guess why? Read on here for the answer.

Conclusion

More concurrency is not always a good thing. By adding more and more tasks to be run concurrently, you are most likely just slowing down your application.

In such cases, await and async are powerful allies in your fight against badly optimized (or de-optimized) code. It is not always easy to implement it, since asynchronous API’s are not always available. They should definitely be used if they are though.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.