Events, sagas and workflows: managing long-running processes between services

An event-driven architecture can be great for providing loosely coupled and autonomous services, but problems can creep in once you try and manage long-running processes. This is particularly the case numerous services need to collaborate to fulfil an order or produce a payslip.

This can be achieved by having each service issue events when they have completed their part of the processing. The problem is that this kind of diffused process orchestration can be very hard to manage at scale.

Things will go wrong along the chain of events and it can be difficult to diagnose where the problem occurred and who should fix it. Changes to process flows can involve having to make simultaneous updates to multiple services each run by different teams. This can be very difficult to achieve in a system that is handling thousands of concurrent processes.

In this sense, events can become a source of coupling that undermines your ability to make changes to processes. For example, making changes to an “order updated” event could have some unintended consequences if it is used in more than one process.

You can design away the problem. To an extent.

These problems can be mitigated by good design. If you have well defined service boundaries then long-running processes can involve a straightforward hand-off between different services. This can reduce any scramble to find where to keep inconvenient logic when a new process is implemented.

The clarity of communication between services can also be improved by using commands. These are an explicit request for processing sent between services. They are very similar to events as they are generally sent using the same messaging infrastructure. For the sake of loose coupling a recipient can also choose to ignore a command, much in the same way they can with an event.

The key difference is that a command only has one recipient. This makes it a useful means of enabling clear and traceable service collaboration as it communicates context where and event just communicates state. You know when a process has been kicked off and can trace its progress in terms of clear and explicit commands rather than hunting down related changes in state.

Are workflow engines the answer at scale?

Once you start running lots of collaborative processes at scale then a host of extra concerns come into play that cannot necessarily be taken care of by individual service implementations. These include ensuring resilience, handling basic failure scenarios such as time outs and retries as well as providing visibility and reporting for active workflows.

In addressing these challenges it’s perfectly viable to offer workflow as a resilient, shared platform service, much in the same way you would for a database cluster or message bus. There are many of these out there, particularly in the Java space.

Netflix Conductor provides a highly available infrastructure for workflows that are defined by JSON files. Zeebe and Camunda provide similar resilience on top of workflows defined by BPMN diagrams, while Amazon Step Functions provide their own kind of “doodleware” to coordinate Lambda functions. In each case, the idea is to abstract away the definitions of workflow from the infrastructure required to run it so feature logic continues to reside in services.

The important thing is that workflow logic should be owned and controlled by services rather than being defined through a centralised repository. This can be a surprisingly fine line. Routing flows between services is one thing, but most workflows at least involve some element of decision logic. This may be trivial at first, but over time, the scope of this can creep so the engine becomes a repository for "hidden" business logic.

Housing business logic in a centralised platform can give rise to many of the problems that affected enterprise service buses or yore. All the knowledge and expertise for this style of platform tends to become concentrated in a small infrastructure team rather than development teams who are familiar with the domain. The can result in a byzantine well of complexity that everybody is scared to change.

More importantly perhaps, you’ll never be able to get rid of it. Your entire infrastructure will be married to a gigantic single point of failure based on an arcane file format. The horror of having to decode masses of JSON or YAML files accumulated over time will become a formidable barrier to change.

Can sagas help?

One of the main problems with distributing processes across services is handling failure. If a process fails in one service, it can be difficult to unwind a bunch of half-completed actions through different services.

The saga pattern provides a more decentralised approach for handling failure in a workflow. Each individual step is expected to provide a compensating action for when something goes wrong. This allows you to manage distributed transactions by providing a means of enforcing data consistency between across services. If something goes wrong, an event can be used to ensure that each service takes a compensating action to deal with failure such as a rollback.

Unfortunately, sagas only partially solve the problem. They are a useful way of dealing with failure in a distributed system, but you still need to consider what to do around providing visibility, monitoring and remote management of workflows.

Many saga implementations involve some form of central co-ordinator to solve these problems. This brings us back to many of the problems of workflow engines – particularly the single point of failure – but it should just provide co-ordination and visibility rather than implementing logic.

Choosing between choreography and orchestration

There are two approaches to managing long running processes in event-based architectures. You can distribute your decision making so that processes are executed through decentralised choreography. Alternatively, you can centralise decision making to orchestrate the execution of your processes.

Distributed choreography is only possible if you have a well-evolved infrastructure. Any service-based collaboration should make heavy use of log collation and monitoring tools. If collation identifiers are sent with commands and events then you can trace a long running process as it works its way through a service infrastructure. Failures can be easily identified, retries automated or recovery strategies put into play.

Alternatively, you can make co-ordination a specific responsibility. You can make individual services responsible for owning processes, while the infrastructure used to manage them is provided as shared infrastructure. The important thing is that process management does not become a centralised service that takes place outside of all the services, away from all the development teams who should be directly responsible for implementing them.