Multi threading C# application with SQL Server database calls Multi threading C# application with SQL Server database calls multithreading multithreading

Multi threading C# application with SQL Server database calls


Here's my take on the problem:

  • When using multiple threads to insert/update/query data in SQL Server, or any database, then deadlocks are a fact of life. You have to assume they will occur and handle them appropriately.

  • That's not so say we shouldn't attempt to limit the occurence of deadlocks. However, it's easy to read up on the basic causes of deadlocks and take steps to prevent them, but SQL Server will always surprise you :-)

Some reason for deadlocks:

  • Too many threads - try to limit the number of threads to a minimum, but of course we want more threads for maximum performance.

  • Not enough indexes. If selects and updates aren't selective enough SQL will take out larger range locks than is healthy. Try to specify appropriate indexes.

  • Too many indexes. Updating indexes causes deadlocks, so try to reduce indexes to the minimum required.

  • Transaction isolational level too high. The default isolation level when using .NET is 'Serializable', whereas the default using SQL Server is 'Read Committed'. Reducing the isolation level can help a lot (if appropriate of course).

This is how I might tackle your problem:

  • I wouldn't roll my own threading solution, I would use the TaskParallel library. My main method would look something like this:

    using (var dc = new TestDataContext()){    // Get all the ids of interest.    // I assume you mark successfully updated rows in some way    // in the update transaction.    List<int> ids = dc.TestItems.Where(...).Select(item => item.Id).ToList();    var problematicIds = new List<ErrorType>();    // Either allow the TaskParallel library to select what it considers    // as the optimum degree of parallelism by omitting the     // ParallelOptions parameter, or specify what you want.    Parallel.ForEach(ids, new ParallelOptions {MaxDegreeOfParallelism = 8},                        id => CalculateDetails(id, problematicIds));}
  • Execute the CalculateDetails method with retries for deadlock failures

    private static void CalculateDetails(int id, List<ErrorType> problematicIds){    try    {        // Handle deadlocks        DeadlockRetryHelper.Execute(() => CalculateDetails(id));    }    catch (Exception e)    {        // Too many deadlock retries (or other exception).         // Record so we can diagnose problem or retry later        problematicIds.Add(new ErrorType(id, e));    }}
  • The core CalculateDetails method

    private static void CalculateDetails(int id){    // Creating a new DeviceContext is not expensive.    // No need to create outside of this method.    using (var dc = new TestDataContext())    {        // TODO: adjust IsolationLevel to minimize deadlocks        // If you don't need to change the isolation level         // then you can remove the TransactionScope altogether        using (var scope = new TransactionScope(            TransactionScopeOption.Required,            new TransactionOptions {IsolationLevel = IsolationLevel.Serializable}))        {            TestItem item = dc.TestItems.Single(i => i.Id == id);            // work done here            dc.SubmitChanges();            scope.Complete();        }    }}
  • And of course my implementation of a deadlock retry helper

    public static class DeadlockRetryHelper{    private const int MaxRetries = 4;    private const int SqlDeadlock = 1205;    public static void Execute(Action action, int maxRetries = MaxRetries)    {        if (HasAmbientTransaction())        {            // Deadlock blows out containing transaction            // so no point retrying if already in tx.            action();        }        int retries = 0;        while (retries < maxRetries)        {            try            {                action();                return;            }            catch (Exception e)            {                if (IsSqlDeadlock(e))                {                    retries++;                    // Delay subsequent retries - not sure if this helps or not                    Thread.Sleep(100 * retries);                }                else                {                    throw;                }            }        }        action();    }    private static bool HasAmbientTransaction()    {        return Transaction.Current != null;    }    private static bool IsSqlDeadlock(Exception exception)    {        if (exception == null)        {            return false;        }        var sqlException = exception as SqlException;        if (sqlException != null && sqlException.Number == SqlDeadlock)        {            return true;        }        if (exception.InnerException != null)        {            return IsSqlDeadlock(exception.InnerException);        }        return false;    }}
  • One further possibility is to use a partitioning strategy

If your tables can naturally be partitioned into several distinct sets of data, then you can either use SQL Server partitioned tables and indexes, or you could manually split your existing tables into several sets of tables. I would recommend using SQL Server's partitioning, since the second option would be messy. Also built-in partitioning is only available on SQL Enterprise Edition.

If partitioning is possible for you, you could choose a partion scheme that broke you data in lets say 8 distinct sets. Now you could use your original single threaded code, but have 8 threads each targetting a separate partition. Now there won't be any (or at least a minimum number of) deadlocks.

I hope that makes sense.


Overview

The root of your problem is that the L2S DataContext, like the Entity Framework's ObjectContext, is not thread-safe. As explained in this MSDN forum exchange, support for asynchronous operations in the .NET ORM solutions is still pending as of .NET 4.0; you'll have to roll your own solution, which as you've discovered isn't always easy to do when your framework assume single-threadedness.

I'll take this opportunity to note that L2S is built on top of ADO.NET, which itself fully supports asynchronous operation - personally, I would much prefer to deal directly with that lower layer and write the SQL myself, just to make sure that I fully understood what was transpiring over the network.

SQL Server Solution?

That being said, I have to ask - must this be a C# solution? If you can compose your solution out of a set of insert/update statements, you can just send over the SQL directly and your threading and performance problems vanish.* It seems to me that your problems are related not to the actual data transformations to be made, but center around making them performant from .NET. If .NET is removed from the equation, your task becomes simpler. After all, the best solution is often the one that has you writing the smallest amount of code, right? ;)

Even if your update/insert logic can't be expressed in a strictly set-relational manner, SQL Server does have a built-in mechanism for iterating over records and performing logic - while they are justly maligned for many use cases, cursors may in fact be appropriate for your task.

If this is a task that has to happen repeatedly, you could benefit greatly from coding it as a stored procedure.

*of course, long-running SQL brings its own problems like lock escalation and index usage that you'll have to contend with.

C# Solution

Of course, it may be that doing this in SQL is out of the question - maybe your code's decisions depend on data that comes from elsewhere, for example, or maybe your project has a strict 'no-SQL-allowed' convention. You mention some typical multithreading bugs, but without seeing your code I can't really be helpful with them specifically.

Doing this from C# is obviously viable, but you need to deal with the fact that a fixed amount of latency will exist for each and every call you make. You can mitigate the effects of network latency by using pooled connections, enabling multiple active result sets, and using the asynchronous Begin/End methods for executing your queries. Even with all of those, you will still have to accept that there is a cost to shipping data from SQL Server to your application.

One of the best ways to keep your code from stepping all over itself is to avoid sharing mutable data between threads as much as possible. That would mean not sharing the same DataContext across multiple threads. The next best approach is to lock critical sections of code that touch the shared data - lock blocks around all DataContext access, from the first read to the final write. That approach might just obviate the benefits of multithreading entirely; you can likely make your locking more fine-grained, but be ye warned that this is a path of pain.

Far better is to keep your operations separate from each other entirely. If you can partition your logic across 'main' records, that's ideal - that is to say, as long as there aren't relationships between the various child tables, and as long as one record in 'main' doesn't have implications for another, you can split your operations across multiple threads like this:

private IList<int> GetMainIds(){    using (var context = new MyDataContext())        return context.Main.Select(m => m.Id).ToList();}private void FixUpSingleRecord(int mainRecordId){    using (var localContext = new MyDataContext())    {        var main = localContext.Main.FirstOrDefault(m => m.Id == mainRecordId);        if (main == null)            return;        foreach (var childOneQuality in main.ChildOneQualities)        {            // If child one is not found, create it            // Create the relationship if needed        }        // Repeat for ChildTwo and ChildThree        localContext.SaveChanges();    }}public void FixUpMain(){    var ids = GetMainIds();    foreach (var id in ids)    {        var localId = id; // Avoid closing over an iteration member        ThreadPool.QueueUserWorkItem(delegate { FixUpSingleRecord(id) });    }}

Obviously this is as much a toy example as the pseudocode in your question, but hopefully it gets you thinking about how to scope your tasks such that there is no (or minimal) shared state between them. That, I think, will be the key to a correct C# solution.

EDIT Responding to updates and comments

If you're seeing data consistency issues, I'd advise enforcing transaction semantics - you can do this by using a System.Transactions.TransactionScope (add a reference to System.Transactions). Alternately, you might be able to do this on an ADO.NET level by accessing the inner connection and calling BeginTransaction on it (or whatever the DataConnection method is called).

You also mention deadlocks. That you're battling SQL Server deadlocks indicates that the actual SQL queries are stepping on each other's toes. Without knowing what is actually being sent over the wire, it's difficult to say in detail what's happening and how to fix it. Suffice to say that SQL deadlocks result from SQL queries, and not necessarily from C# threading constructs - you need to examine what exactly is going over the wire. My gut tells me that if each 'main' record is truly independent of the others, then there shouldn't be a need for row and table locks, and that Linq to SQL is likely the culprit here.

You can get a dump of the raw SQL emitted by L2S in your code by setting the DataContext.Log property to something e.g. Console.Out. Though I've never personally used it, I understand the LINQPad offers L2S facilities and you may be able to get at the SQL there, too.

SQL Server Management Studio will get you the rest of the way there - using the Activity Monitor, you can watch for lock escalation in real time. Using the Query Analyzer, you can get a view of exactly how SQL Server will execute your queries. With those, you should be able to get a good notion of what your code is doing server-side, and in turn how to go about fixing it.


I would recommend moving all the XML processing into the SQL server, too. Not only will all your deadlocks disappear, but you will see such a boost in performance that you will never want to go back.

It will be best explained by an example. In this example I assume that the XML blob already is going into your main table (I call it closet). I will assume the following schema:

CREATE TABLE closet (id int PRIMARY KEY, xmldoc ntext) CREATE TABLE shoe(id int PRIMARY KEY IDENTITY, color nvarchar(20))CREATE TABLE closet_shoe_relationship (    closet_id int REFERENCES closet(id),    shoe_id int REFERENCES shoe(id))

And I expect that your data (main table only) initially looks like this:

INSERT INTO closet(id, xmldoc) VALUES (1, '<ROOT><shoe><color>blue</color></shoe></ROOT>')INSERT INTO closet(id, xmldoc) VALUES (2, '<ROOT><shoe><color>red</color></shoe></ROOT>')

Then your whole task is as simple as the following:

INSERT INTO shoe(color) SELECT DISTINCT CAST(CAST(xmldoc AS xml).query('//shoe/color/text()') AS nvarchar) AS color from closetINSERT INTO closet_shoe_relationship(closet_id, shoe_id) SELECT closet.id, shoe.id FROM shoe JOIN closet ON CAST(CAST(closet.xmldoc AS xml).query('//shoe/color/text()') AS nvarchar) = shoe.color

But given that you will do a lot of similar processing, you can make your life easier by declaring your main blob as XML type, and further simplifying to this:

INSERT INTO shoe(color)    SELECT DISTINCT CAST(xmldoc.query('//shoe/color/text()') AS nvarchar)    FROM closetINSERT INTO closet_shoe_relationship(closet_id, shoe_id)    SELECT closet.id, shoe.id    FROM shoe JOIN closet        ON CAST(xmldoc.query('//shoe/color/text()') AS nvarchar) = shoe.color

There are additional performance optimizations possible, like pre-computing repeatedly invoked Xpath results in a temporary or permanent table, or converting the initial population of the main table into a BULK INSERT, but I don't expect that you will really need those to succeed.