Eventual consistency and the trade-offs required by distributed development

4 May 2014

Developers who have been brought up on database clusters and ACID transactions often have a problem trusting eventual consistency. The idea that a platform should be allowed to “close the books” can be at odds with the long-cherished goal of controlling data concurrency.

Once you start distributing processing then you are forced into a trade-off between the availability of the system and consistency of its data. Given how important availability is to most commercial systems then some inconsistency over data has to be tolerated to ensure that requests can always be processed. However, more often than not, once you start exploring the requirements in more detail then this trade-off doesn't feel so bad.

Identifying the trade-offs

The key trade-offs in a distributed system are best expressed by Eric Brewer’s CAP theorem who suggested that you can only provide two of the following three guarantees at the same time:

Consistency - every node sees the same data at the same time
Availability – every request receives a response
Tolerance to network partitions– the system continues to operate despite the kind of arbitrary message loss or failure associated with separate networks.

Tightly-contained systems such as clustered databases can provide consistency and availability but they have to operate within the same network environment. Consistency implies a need for transaction locking which can be implemented in a more distributed environment, but this will come at the cost of availability. Distributed systems such as DNS or web caches that offer genuine guarantees over availability do so at the cost of some data consistency.

Given that network partitions are a fact of life for larger distributed systems, you have a choice between relaxing consistency or availability. The challenges of maintaining state are such that systems need to tolerate a degree of inconsistency if they are to ensure that requests can always be answered. Thus, eventual consistency becomes a necessary side-effect of ensuring availability.

ACID vs BASE

A key benefit of embracing eventual consistency is that it removes the need to distribute transactions. These have a significant impact on the scalability, performance and availability of systems at scale. They are also very complex to implement.

This can be difficult for developers to understand, particularly when they have been raised on the certainties of ACID transactions (Atomic, Consistent, Isolated, Durable) where clustered database servers take away all the pain of ensuring data consistency. The fact that two key systems can have a different view of the state of a customer account sets their heads spinning as they try to work out how to resolve the various paradoxes and race conditions this can create.

ACID’s scruffier cousin is BASE (Basic Availability, Soft-state and Eventually consistent) which suggests that data updates can be more relaxed. It’s fine to use stale data and not a problem to provide approximate answers. This does not necessarily mean that the hand-cart has been primed as it can be surprising how often you can live without immediate consistency.

Race conditions don’t always exist in the business world

Most of the time, a few seconds of inconsistency is unlikely to make a difference to the underlying process. Developers are often quick to accept them at face value and build in unnecessary cross checking that is not really required by the business requirements.

This can be illustrated by a simple ecommerce example with distributed billing and fulfilment:

If the order has been shipped then don’t allow the user to cancel
Don’t let the goods ship if an order has been cancelled.

On the face of it, a race condition happens when two different users submit orders to ship and cancel at the same time. However, if you drill deeper things become a little more open.

Distance selling regulations mean that the user should be able to cancel after the order has shipped.
A refund doesn’t have to be issued the moment an order is cancelled.
Most cancellations happen very soon after the order has been placed rather than a few days later.

The point is that it’s not always safe to assume that everything needs to be enforced and processed immediately. Long-running processes are actually quite common once you get under the skin of them and true race conditions are relatively rare in business processing.

Avoid getting bogged down in edge cases

Distributed systems are often created to meet complex, distributed business processes. Eventual consistency reflects real-world business processes where different actors collaborate on a system over a protracted period. You rarely get cascades of critical events happening in the short time it takes for a distributed platform to achieve consistency.

Developers using a distributed system have to be aware of which trade-offs have been made. If a system emphasises consistency at the expense of availability then a client will have to manage when to write data. In more available systems they can assume that a write will be expected but will have to be prepared for inconsistencies over the subsequent data reads.

Eventual consistency does inevitably carry a small risk of concurrency problems such as dirty reads. However, once you fully explore your business processes and understand the life cycle of your data these normally turn out to be very small risks that are associated with edge cases. You can usually allow some level of tolerance so the entire platform does not have to be designed for a tiny fraction of problem transactions that may never actually occur.

Filed under Architecture, Design patterns, Messaging, Distributed Applications and Event-driven systems.

Eventual consistency and the trade-offs required by distributed development

Identifying the trade-offs

ACID vs BASE

Race conditions don’t always exist in the business world

Avoid getting bogged down in edge cases

The problem with tiered or layered architecture

When should you write your own message endpoint library?

Messaging anti-patterns in event-driven architecture

Implementing complex workflows in AI Agents

What should architects focus on - and what should be delegated to teams?

Versioning doesn't make it any easier to manage change in APIs