26 June 2014

Optimising bulk inserts with Entity Framework 6

In Entity Framework 6+ the database context implements the unit of work pattern so that any changes you make to the model are not persisted until you call the SaveChanges() method.

This is the most obvious way of implementing any bulk update. You just push the changes into the context and persist them in a single operation as shown below:

using (var context = new SqlDbContext())
{
    foreach (var book in bookList)
    {
        context.Books.Add(book);
    }
    context.SaveChanges();
};

The problem with this technique is that it is slow. Very slow. Record sets of 10k or more are measured in minutes rather than seconds.

Understand what’s happening under the hood

Before you can optimise any data access then you should have some insight into the SQL statements the Entity Framework is attempting to execute against the database. This is easier to do with version 6 of the entity framework as you can get the context to output all the database calls it makes with a single line of code:

context.Database.Log = Console.Write;

No matter what optimisations you make to the entity framework the SQL that goes to the database server is always the same. Each record will be added in a single insert statement involving a separate round-trip between client and server.

You can make significant gains by changing entity framework settings, but these changes optimise the internal workings of the entity framework itself rather than the way it is calling a remote database. You cannot do anything about the basic bottleneck where inserts are going in row by row.

Switching off change detection

For large data volumes the biggest single saving you can make is by turning off the automated change detection on the database context.

This is the entity framework’s internal process for checking the current property values of an entity against the original values. It gets triggered for any method that updates data and gets particularly expensive if the context is tracking a large number of entities.

Note that if you plan to re-use the context for any other operation then you should consider re-enabling change detection after the bulk insert operation. This can be done through a try\finally clause as shown below:

try
{
    context.Configuration.AutoDetectChangesEnabled = false;
 
    foreach (var book in bookList)
    {
        context.Books.Add(book);
    }
    
    context.SaveChanges();
}
finally 
{
    context.Configuration.AutoDetectChangesEnabled = true;
}

You can also consider switching off model validation so the entity framework won’t validate each entity as it is submitted to the database. This provides a much smaller performance gain that is unlikely to be significant unless you are working with thousands of rows of data.

Batching and resetting the context

As you insert records the context graph tends to get larger, more complex and potentially much slower. Some benchmarks have suggested that by disposing and re-creating the context after a certain number of records you can improve overall performance.

foreach (var book in bookList)
{
    context.Books.Add(book);
    counter++;
    if (counter >= threshold)
    {
        context.SaveChanges();
        context.Dispose();
        context = new SqlDbContext();
        context.Configuration.AutoDetectChangesEnabled = false;
        counter = 0;
    }
}
 
context.SaveChanges();

The potential gains here are not guaranteed and they depend very much on the circumstances. The latency between your client and the server is particularly significant as no amount of tweaking is going to overcome an IO-bound batch operation. If you are using a simple object model and connecting to a remote SQL database hosted in Azure then resetting the context regularly is unlikely to have much impact.

The optimal approach? Don’t use the entity framework!

There’s only so much to be gained by optimising the entity framework for bulk inserts. It’s just not designed to deal with this kind of scenario. When you are inserting data in bulk you want to push it to the server in single operation without any chatty feedback or data tracking.

This is why for large bulk inserts its best to step outside of the confines of the entity framework. For Sql Server the SqlBulkCopy class was designed specifically for large, bulk insert scenarios. You’ll normally find that most solutions designed to extend the entity framework with dedicated bulk methods are using SqlBulkCopy in the background.

It’s cumbersome to work with and does not produce the most elegant code, but it will upload data in a fraction of the time that can be achieved by optimised entity framework code.

Filed under C#, Net Framework, SQL Server.