24 November 2012

Regular expressions and RegEx performance in .Net

If you run into performance problems with regular expressions in .Net then your options for tuning and improving are pretty limited. Your only real hope is to keep a close eye on exactly what you’re throwing into the RegEx black box.

The hammer for every string-related nail

When developers first discover regular expressions it can turn into a bit of a love affair. They’re powerful, flexible and save you the trouble of having to write cumbersome parsing code to manipulate strings. Suddenly regular expressions start to pop up everywhere as they start to seem like the solution for every string-related problem. This is where things can start to go sour.

The bottom line is that regular expressions may be convenient, but they are not necessarily fast. They can also give rise to some pretty unreadable code. Short expressions designed for simple use cases might be easy to understand, but expressions designed for more complex problems can quickly get out of control. If you find yourself using monster-length expressions then your code might have maintainability issues that warrant a change of approach.

For example, it may seem reasonable to validate an email address through a regular expression, but it gets a lot more complex as you start to chase down the edge cases. The following delightful expression will validate an email address according to RFC 2822:

(?:[a-z0-9!#$%&’*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&’*+/=?^_`{|}~-]+)*|”(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*”)@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

This cryptic syntax can conceal a lot of complex string parsing logic that takes a fair amount of processing. Unless you have a clear understanding of what it going on under the hood then this can really start to rebound on you. A regular expression can contain subtle bugs that are difficult to unstitch and unexpected performance problems can lurk within them unless they are written very carefully.

Backtracking and runaway expressions

One example of an unexpected problem concerns the kind of runaway back-tracking that can occur when trying to find matches along longer input strings. Jeff Atwood published an example back in 2006 that applied the expression (x+x)+y to a growing string of characters. The code below shows this test implemented in C#:

string pattern = "(x+x+)+y";
Stopwatch watch = new Stopwatch();

for (int n = 1; n <= 30; n++)
{
    var input = new String('x', n);
    watch.Start();
    Regex.Match(input, pattern);
    watch.Stop();
    Console.WriteLine(string.Format("{0} characters takes {1} milliseconds.", n, watch.ElapsedMilliseconds));
    watch.Reset();
}

When the test starts it reports zero milliseconds for the first dozen or so expressions. By the time the input string reaches thirty characters you having to wait around up to five minutes for the expression to execute.

This expression looks for an instance of “x”, another instance of “x” and then tries to find one or more of the previous two matches combined. This has to be followed by an instance of “y”. This turns ugly when the “y” is missing from the string as the regular expression starts to backtrack through every combination of “x” trying to meet the condition. You won’t notice the processing impact on a small string, but the cost grows exponentially with the length of the string.

This may be a contrived example, but it does have real world equivalents. The example below shows a regular expression that checks a CSV file line to ensure that the tenth character starts with a “P”. if the input data is invalid then processing slows exponentially as the number of fields increases.

string pattern = "^(.*?,){10}P";
string input = "1,2,3,4,5,6,7,8,9,10";
Stopwatch watch = new Stopwatch();

for (int n = 11; n <= 30; n++)
{
    input = string.Concat(input, ",", n);
    watch.Start();
    Regex.Match(input, pattern);
    watch.Stop();
    Console.WriteLine(string.Format("{0} fields takes {1} milliseconds.", n, watch.ElapsedMilliseconds));
    watch.Reset();
}

This is caused by imprecise matching. The dot in the expression can also match the delimiter, so the regular expression engine will rifle through the input string finding every possible permutation of fields. It’s a bug in the expression, but one that is difficult to find and will not manifest itself until you have tested it against a particular data set.

Managing the black box

There are generally very few options for tuning the execution of regular expressions. They can be a bit of a black box in that once you have entered your inputs you just have to wait to see what comes out.  You are very much at the mercy of the framework that you happen to be using and .Net in particular offers relatively limited tuning options for regular expressions.

You can exchange faster execution for slower initialisation on frequently used expressions by specifying the RegexOptions.Compiled option when creating a RegEx. However, bear in mind that the engine caches regular expressions in memory anyway so you may not get the performance improvement you hoped for.

If you do have to use regular expressions on the fly then you should use the static matching methods in the RegEx class rather than creating a new instance every time. This will allow you to execute statements without the overhead of object creation and you will still get the benefit of expression caching if you are reusing statements.

If your statements are completely immutable then they can be compiled into separate assemblies using the Regex.CompileToAssembly() method. This is the fastest way of using regular expressions, although you may find the speed gains disappointing given the loss of flexibility.

Beyond that you just have to optimise your statements, watch your use of wildcards and consider your inputs. It’s always more difficult to work with unconstrained input so you might want to sanitise input a little before using a regular expression. This will help to cut down the edge cases that have to be catered for and encourage smaller, more manageable expressions.

Filed under C#, Net Framework.