3 October 2014

The problem with GUIDs…

GUIDs solve an important problem, i.e. how to guarantee that an identifier is globally unique. This is really useful and has all sorts of applications, particularly when you’re replicating or combining data sources.

The problem is that they have become a hammer for every nail as developers lazily reach for them to tag any piece of stray data with a unique identifier. Most identifiers just do not need to be globally unique. For example, do we really need a globally unique guarantee for an identifier on a look-up table or a message time stamp?

The problem with GUIDs is their readability. It’s almost impossible to tell one thirty six letter code from another. This problem becomes particularly acute when you’re trying to troubleshoot data with more than a handful of rows. This readability issue is often dismissed as mere inconvenience, but it’s a real problem for anybody who has to support applications or trouble-shoot data.

GUIDs are often a lazy solution selected by developers who will not have to deal with the support consequences. In many cases there are simpler solutions that make both code and data much clearer to work with. A sequential counter or simple code can make all the difference to somebody who is trying to investigate a fault.

GUIDs as surrogate keys

One of the more common over-uses of GUIDs is as primary keys on data tables. This is a similar problem to identity over-use, where developers automatically reach for an identity column without really understanding the difference between a natural key and a surrogate key.

A natural key is where one or more columns in the can be used to uniquely identify each row. One weakness of natural keys is that they are vulnerable to changes in an organisation’s structure. You’ll want your keys to be immutable so they do not change throughout the lifetime of your application.

Surrogate keys get around this problem by adding an artificial construct to the data that is both unique and immutable. This might seem like a neat way out, but it can actually complicate data design. With a surrogate there’s nothing to save you from adding duplicate data unless you use another construct to enforce uniqueness of the “natural” data such as a constraint. Your data is also less readable as it becomes littered with abstract identifiers.

There is room for both approaches within a sensible data design. I tend to use surrogates to define entities, but prefer natural keys when relating them. For example, I would use surrogate keys for tables that define entities such as “people” and “jobs”, but a natural key in the link table that describes the jobs that people are associated with. Not every table is improved by a GUID.

Performance problems

A GUID can be a bad choice of surrogate for performance reasons, particularly in SQL Server where clustered indexes based on GUIDs tend towards fragmentation. This can be improved by using sequential GUIDs, though these are more vulnerable to being guessed or brute forced. In most cases, the simplest solution is to use boring old incrementing integers.

If you’re going to replicate or combine disparate data sources then you really will need some globally unique identifiers. However, this is an implementation detail that does not have to be baked into data design. There’s nothing to stop you from adding separate identifiers onto your data rows in response to replication requirements.

The point is that most of the time you do not need to pay the readability tax that comes with GUIDs. It’s just inconsiderate to the poor sucker who has to work with a set of obtuse codes later on down the line.

Filed under Rants, SQL Server.