12 October 2018
Messaging anti-patterns in event-driven architecture
Event-driven architecture allows services to collaborate by publishing and consuming events. In this context an event describes a simple change in state. A service can broadcast events to one or more consumers without needing to know who might be listening and how they might be responding.
This approach encourages loose coupling between services by enforcing an indirect style of collaboration where services don’t need to know about each other. The sender doesn’t care who is receiving the event, while the consumer doesn’t necessarily care who sent it.
Services that are integrated via asynchronous event streams tend to scale better than those that use direct API calls or shared data stores. Resilience is also improved as a service outage is less likely to give rise to cascading failure. Events are also highly versatile so can be used to represent pretty much any business process.
The catch is that these advantages of scalability, resilience and flexibility are very much dependent upon the design of events. There are plenty of traps for the unwary that can undermine the potential benefits of event-based integration.
Dependencies between events
Events should be autonomous and atomic. They should not have any dependencies on any other events. A consumer should be able to process an event in its entirety without having to wait for another event to turn up (the chances are that it never will!).
This implies that an event should contain all the information that a consumer needs to process the event. This doesn’t necessarily mean the event needs to be enormous and contain every piece of related data. The design challenge is to include “just enough” data to allow a downstream service to make sense of the event. Consumers only need to be aware of the information that directly relates to the change of state and can be tolerant of missing information if needs be.
An event should be an abstraction that represents a business process. It shouldn’t include the internal implementation details of a service. This kind of information leakage can cause knowledge coupling between services as they become a little too aware of each-other’s internal workings.
Designing events that reflect an underlying relational database model is another type of leaky event. The events are sent in response to changes in database entities and they represent CRUD actions rather than business processes.
This type of entity-based event lacks clarity. It’s not immediately obvious what business process is being represented by an event like "order updated" – has the order been placed, adjusted, picked or shipped? Events should not be used to replicate databases as this tends to leak implementation detail and couple services to a shared data model.
Entity-based events also tend to be very inefficient. Events are sent for every inconsequential change to the entity, creating very “chatty” integrations that put unreasonable pressure on consumers.
Events should be specific in that they model a single business process. It should be possible to understand what an event means from the title alone – e.g. order placed. This helps with clarity and makes it easier for a downstream service to decide whether it needs to process the event.
This clarity is undermined if you create more generic events that use switches or flags to clarify the intent. This is a similar problem to entity-based events where events are based more on an internal data model than an external business process. It tends to give rise to an unclear and inefficient integration.
Implementing a sequence
Given that messaging is asynchronous you cannot guarantee the order in which events will be sent, received and processed. Even “first-in-first-out” guarantees cannot necessarily ensure that messages will be processed in the order in which they were originally sent.
Trying to assert ordering on any event-based architecture tends to add complexity with very little benefit. It will also undermine many of the benefits of event-based messaging, i.e. decoupling, buffering and scalability.
The best solution for ordering is to design it out of your events. This isn’t as difficult as it sounds so long as you are modelling genuine business processes rather than monitoring changes to entities. If you are really stuck then the sender can implement a sequence number, though this is not trivial to implement particularly if you have multiple senders.
Events should be self-contained. Publishers should not make any assumptions around how events will be processed by a consumer. This kind of assumed knowledge couplies services together via events.
For example, a publisher may expect to receive a specific type of event back from a consumer to acknowledge that the original event has been processed. This is using asynchronous messaging to model a request\response exchange. The two services are coupled into a process and you may want to consider re-drawing service boundaries or refactoring your events.
Commands in disguise
Many event-based architectures also implement commands. There are requests from one service to another to do something. Commands often use the same messaging infrastructure as events, except with different semantics and delivery to a single consumer.
Commands can be useful, but they can undermine service autonomy so should be used sparingly. A service needs to have intimate knowledge of what another service can do before it issues a command. You are also allowing services to dictate to each other, which inevitably increases coupling.
My own preference is to do away with commands altogether as you can model pretty much any interaction as an event. Most commands have an equivalent event – e.g. instead of a “place order” command you can have an "order placed" event. The risk here is that you end up with events that are just commands in disguise, i.e. events with a single recipient where the sender expects a response through a related event (e.g. "order accepted").
Queries in disguise
You can’t model every single interaction with events. There will be times when you need to implement a synchronous, query-style interaction. For example, you may need to find a balance for an account or figure out the permissions for a user. Some interactions are real-time questions that demand real-time answers.
You can try to model this kind of exchange using events, but only if you’re happy for it to an asynchronous, long-lived operation. In the cases where only an immediate answer will do then you’re better off using some form of RPC, such as REST or gRPC.
This will create direct, temporal coupling between services. Sometimes this is necessary, bit you should be mindful of those cases where the need for a query interaction is a side-effect of poor design. Your service boundaries may need re-drawing to eliminate the need for any direct queries.
Events as method calls
Once you have an event infrastructure in place it can be easy to get carried away and publish too many events. Events should have some cost associated with them or the boundaries between services will start to feel meaningless.
Too many events can give rise to “chatty” integrations between services where events are used as commonly as a method call. With this style of event you will find over time that services are having to send and receive an escalating number of messages. This can place quite a burden of downstream systems as they need to work harder to keep up with the pace of the message flow.
Too few messages
The opposite problem to using events as method calls is environments where there aren’t enough events. This often happens in legacy environments when a shared database is lurking somewhere within the architecture. Services are accustomed to being able to read and write from a shared store so adding a new event seems like an overhead.
It can take some time for event-driven integration to take hold in a legacy environment, but it’s absolutely worth the investment. You will eventually build up a critical mass of events that allows you to break free from shared databases. This requires that you keep the discipline of implementing event messages in place more immediate database calls.
The choice of underlying messaging platform will have a significant impact on the way that messages are sent and consumed. A message broker such as RabbitMQ will track the status of individual messages, allowing features such as transactional semantics (i.e. retries) and duplicate detection to be handled by the infrastructure.
On the other hand, a high-volume streaming platform like Kafka exposes event logs, leaving it up to the client to track their position and implement any retry logic. The advantage is that it’s easier to scale delivery and provide an “event firehose” where services can consume very high volumes of events and replay streams more easily. The downside is that client implementations become significantly more involved.
It’s important that any messaging technology is “right sized” against likely throughput. Message brokers can become extremely expensive once you are processing hundreds of millions of messages per month. However, it doesn’t make sense to sacrifice a broker’s messaging features unless you really are going to scale beyond that point.