In this article, we discuss distributed transactions — why they are necessary in a microservice architecture and what implementation options we have. The explanation is aimed at those who are not familiar with the topic, who may wonder why add so much complexity to a simple transaction. After all, it prolongs development and increases the number of potential failure points. We’ll explain why it’s needed, provide examples of projects, and include a bit of reflection.
Transactions: The Classic Approach
In the simplest case of a monolithic service, the backend just interacts with a database. For example, we create a user, an order for that user, and an entity related to the order, all within a single transaction. Everything is either created, or nothing is, thanks to the database’s ACID principles (https://en.wikipedia.org/wiki/ACID), which protect us from a wide range of potential issues.
Here’s a typical three-tier architecture: frontend, backend, and database. We can write a lot of code within a single transaction that allows us to do everything we need.
Unfortunately, this ideal world is now fading into the past. Systems are becoming distributed. We break up monoliths into separate services for users, orders, and notifications, each with its own database. In this new paradigm, it’s impossible to perform everything atomically within a single transaction. Distributed transactions help solve this problem.
Essentially, we introduce another database whose task is to control transactions (now it’s the transaction coordinator). Our action is divided into two phases: first, everything prepares, and then commits are executed. In other words, the user service prepares its commit and communicates with the coordinator. Then the order and notification services join the same transaction ID. In the second phase, upon the coordinator’s command, all services execute their commits. This procedure is called a two-phase commit (2PC or 2-phase commit).
At first glance, this seems like a reasonable approach — we don’t change the paradigm too much. We do the same thing, just with the help of a transaction coordinator. However, in practice, this scheme doesn’t work very reliably — transactions can get stuck, and it’s difficult to recover the system from such a state.
Why the Classic Approach Doesn’t Work
The key issue is that, in the real world, out of the three main properties of a distributed system — consistency, availability, and partition tolerance — you can prioritize no more than two during development. This is known as the CAP theorem, which is well explained here: https://en.wikipedia.org/wiki/CAP_theorem.
We can choose availability and consistency, but if there are network issues, we will suffer from a lack of partition tolerance. Alternatively, we can choose partition tolerance and consistency, but this sacrifices availability — this is the case with the two-phase commit. Nowadays, for businesses, downtime can lead to significant financial losses, so in microservice architectures, the focus is often on availability and partition tolerance. This leaves us with eventual consistency, which microservices must accept and work with.
While no one likes eventual consistency, the industry has more or less learned to live with it without causing significant harm to businesses. Consistency is achieved, but not immediately. For example, a user might be created, but their order isn’t yet complete. Or the order might exist, but the notifications haven’t been sent. This is considered normal. After a short wait, everything will likely become consistent.
Choreography vs Orchestration
There are two approaches to organizing transactions in distributed systems: the two-phase commit described earlier and its alternative — the SAGA pattern. SAGA can be implemented in two ways — choreography-based or orchestration-based. Let’s explore these using the same sequence of actions:
- Create a user
- Create an order for that user
- Send a notification for the order
Choreography is when all services “dance” together. There is no single leader; services interact with each other directly, gradually completing all the necessary work.
Here’s how our example would look with choreography: We create a user, and once this process completes successfully, the service simply sends out an event. Another service listens for this event and can perform its own action based on it. In our case, the order service listens for the event, creates an order, and similarly sends out another event, which triggers the notification service to act.
Orchestration is the opposite approach, where a single leader — the orchestrator — manages all interactions. When an action needs to be performed, the orchestrator communicates with all services, commanding them and tracking their progress, knowing where the process stopped in case of failure.
The orchestrator approaches the user service and instructs it to create a user. The service responds, confirming that the user has been created. Then, the orchestrator moves to the order service to create an order. After that, the notification service sends a message, again under the orchestrator’s command.
For orchestration, it’s crucial to derive the business process from the actual business logic. When developing an application, it’s important to understand how the business operates and base all interactions on this understanding.
Cinema Example
Let’s say we are building an online booking system for a cinema. As part of the process, we need to perform two actions: reserve a seat for the customer and charge them through a payment service.
In a monolith, both actions could be placed within a single transaction, and everything would execute atomically — either the payment is processed, and the seat is reserved, or nothing happens because the transaction rolls back (for this example, let’s assume that charging is done by deducting from a “balance” field in the database). However, if one service handles payment and another service handles seat reservation, it’s no longer possible to perform this in a single transaction.
Of course, we could ignore the issue and leave it without a transaction. But if we deduct the money first and then reserve the seat, it may turn out that the seat was already taken by someone else while we were processing the payment. The customer would pay but not get their seat. Conversely, if we reserve the seat first and are then unable to charge the customer (due to an expired card or insufficient funds), the customer gets the ticket for free. In English literature, this is known as the dual write problem (https://www.confluent.io/blog/dual-write-problem/).
All of these issues arise due to the lack of atomicity. In a microservice environment, achieving atomicity again (creating a transaction that covers all possible scenarios) is not feasible. At the very least, this is hindered by the fact that we are interacting with an external service to process the payment.
To solve this issue, we need to understand the nature of the business itself.
When we visit a physical box office at a cinema, paying and reserving a seat don’t happen simultaneously. The cashier essentially holds the seat for us, and if we fail to pay, the seat returns to the pool of available seats. This lack of atomicity is a characteristic of the business itself, and it just needs to be properly reflected in the system’s architecture.
In our case, we could introduce another entity — a “reservation” with a specific status model (e.g., reserved, paid, etc.). Using the SAGA pattern, instead of trying to perform one atomic action, we can execute several actions, updating the status each time. This business process can be modeled after real-world processes, and it’s much better than trying to cram everything into a single transaction. It also allows for easier collaboration with the business on how it should work. Additionally, it makes life easier for support teams, who may need to resolve stuck orders. Overall, it would also make the system easier to evolve later, such as adding services like emailing tickets.
Example with a Travel Agency
The second example illustrates that sometimes it’s fundamentally impossible to address all concerns with a single transaction. If we are developing a service for a travel agency that offers composite services — booking flights and hotels, renting cars — we are compelled to interact with various external services.
We would like to handle everything in one transaction because either the customer pays for everything, or we have to cancel it all. However, we must proceed sequentially: first, purchase the ticket, then book the hotel, and if the hotel is unavailable, cancel the ticket, and so on. This complexity is inherent to the business, and we must reflect it in the service as well. We need to develop an approach to execute the chain of actions, where the cancellation of each step could derail the entire process.
Should we charge the customer first and then make all the bookings (and can we easily refund them in case of failure)? Or is it better to do the opposite? Which action is best to perform first?
What Does Scale Have to Do With It?
The choice between choreography and orchestration usually depends on the scale of the system.
Choreography has a relatively narrow scope — each service “sees” itself and its neighbors, with which it communicates. None of them “see” the system as a whole. This is both a blessing and a curse. For small systems, everything is straightforward. But for larger systems, where there are at least 20 microservices, it becomes unclear what state the entire system is in and where exactly the problem lies that causes everything to come to a halt. As a result, choreography is not recommended for situations where a sequence involves many services or complex interactions, as the business logic becomes spread thin across all services. We send out events, pretending not to know who is subscribed to what, or even if they are subscribed at all — resulting in weak coupling. However, we don’t send useless events just for the sake of it. We only send what we know will be needed by another service, and we usually even know which one. This means that the scenario is not explicitly outlined and is spread throughout the system. This makes it harder to maintain because there isn’t a single file where we can see the entire description. No single service fully understands the scenario. If the system begins to grow, maintaining choreography becomes very challenging.
In the end, choreography is better suited for simpler cases.
In the orchestrator model, we give it more authority. Essentially, we define and store the business process on the orchestrator’s side, outlining the various stages and how they are interconnected. The orchestrator keeps track of what has been done during the process. Meanwhile, the services themselves do not know who they are connected to, and the entire process is edited centrally. For instance, if tomorrow we want notifications to be sent differently (through another service), this can easily be managed via the orchestrator.
However, orchestration is quite substantial on its own. As soon as you start doing it, you have to consider various factors. If something goes wrong with the orchestrator, the entire distributed system may come to a standstill. In simple cases, this doesn’t make much sense.
It may seem that the orchestrator is essentially a monolith from which we’ve just pulled out individual services. But that’s not the case. The orchestrator is agnostic; it knows nothing about our business or its logic. The entire business process is described in a separate file — this could be XML or even Java code. This file is what remains of the monolith.
In other words, the choice between choreography and orchestration represents yet another trade-off in the world of microservices. Overall, it seems to me that there are no good solutions in this realm. There are pains, and you simply choose which pain suits you better today.
Ready-Made Solutions
There are many ready-made frameworks to address tasks related to orchestration and choreography. There are fewer for choreography, as the lack of a single central control is harder to encapsulate in a separate library. However, my recommendation is that if such a task arises in a project, do not write your own orchestrator or workflow manager. Often, this is a Sisyphean task. Instead, look for existing solutions and try to use them.
Here’s a list of ready-made frameworks on GitHub: Awesome Workflow Engines.
My subjective impression is that among backend developers, not many have worked with orchestrators or choreography, as they have not been in high demand until recently. It seems that this will change in the near future, as distributed microservices systems are becoming more prevalent. To work with them, it’s important to know that orchestrators and workflow managers exist and to understand how they work (and to consciously choose not to use them if that’s the case).
This article was written in the wake of a training session on distributed transactions by Dmitry Litvin.
PS. Subscribe to our social networks: Twitter, Telegram, FB to learn about our publications and Maxilect news.