Kafka for Junior Developers, Analysts, and Testers

Maxilect
14 min readNov 8, 2024

--

A few years ago, the Kafka hype began. Everyone wanted to use Kafka, often without fully understanding what they specifically needed it for. Even today, many still add Kafka to their projects, sometimes expecting its mere presence to magically improve everything.

On one hand, this trend can be positive, as it drives the industry forward. However, it’s best to understand what you’re doing; otherwise, it could end up making your project worse. In this article, I address developers, analysts, and testers who haven’t yet encountered Kafka in their work. I’ll help clarify why many choose Kafka over simply using REST in a microservices environment — what exactly Kafka does, and when it makes sense to use it.

The Two Generals Problem

This model perfectly illustrates the root of many issues in microservices — something that also relates, in part, to Kafka. Let’s start there.

A lot is said about this concept, but I recently read an article where the author seemed unaware that such a universal law exists and was trying to find a solution that doesn’t actually exist. So, let’s begin with the basics.

Imagine there are two generals on a battlefield, each with their own army. Between them lies the enemy. The generals want to attack from both sides. However, they must agree on the timing and coordinate the attack to strike simultaneously; otherwise, they’ll be defeated one by one.

Each general has their own messengers. The first general sends a messenger to the second to inform him that the attack will begin tomorrow at 8 a.m. But the enemy is vigilant and may intercept the messenger before he reaches the second general.

If the messenger reaches his destination safely, the second general receives the message. But now, he must send a messenger back to confirm that the message has been received and that the attack is on. After all, the first general, being aware of the risks, is unlikely to proceed without this confirmation. However, as before, the enemy might intercept this messenger, and then the first general would never know if his message was received.

The messenger could make it back to the first general, but for the attack to go ahead, the first general needs to confirm yet again (the second general doesn’t want to charge in alone). This means the messengers would have to travel back and forth continuously to ensure that each general knows the other received their message.

This dilemma has no definitive solution — it’s a fundamental limitation that cannot be avoided. If the messenger does not return, the only option is to keep sending more messengers until one finally makes it back to confirm that both sides are ready.

In microservices, the situation is much the same, except here, the generals are the microservices themselves, and the hostile environment between them is the network — whether the internet or a local network.

For instance, imagine we need to deduct 100$ from a customer named John via a payment service. Our application communicates with the service over REST. We can be confident in the result only if we receive all confirmations (responses). The response could be “yes, the money has been deducted” or “no, it hasn’t been deducted because there are insufficient funds.” And that’s an ideal scenario.

There are also plenty of situations where we may not get a response at all.

First, our request might not reach the payment service (due to network issues or because the service on the other end is unavailable). The payment service could also receive our message but fail to deduct the money if it crashes while processing the transaction.

The service may have deducted the money from the client, but it might not respond to us due to network issues or because our own application is down, meaning the deduction happened, but there’s no one to process it.

Our own application could also crash after the deduction is completed but before a record of it is made.

In the last two situations, it’s as if we’ve deducted the money, but our application isn’t aware of it — meaning we still lack a definitive result. Generally, this situation is unsolvable. Just like with the two generals, we can only be certain of the outcome if we receive an explicit response that everything went okay.

A Complicated Path Beyond Simple REST

In all unsuccessful scenarios, we’re left with repeatedly contacting the payment service to find out what happened. But retrying requests over REST isn’t an ideal solution either. A typical case is when the network briefly glitches, and we end up accidentally making 10 deductions.

This issue can be addressed with idempotency, meaning we design a mechanism on the payment service side that prevents the same request from deducting money twice. For example, we could add a request ID inside the message, and if it repeats, simply ignore it. In this case, the server should be able to handle duplicate messages.

The next problem is performance. Suppose the payment service is down, and we keep retrying. We would need to maintain a constant stream for this, essentially DDoSing the network with these retries.

We need to consider additional techniques — for example, if the initial retries fail, try again after 30 seconds, then 2 minutes, then 10 minutes, and so on. This increasing interval between retries is called backoff. We might even want to stop retries for a while when the service is unavailable (this is called a circuit breaker). Instead, we could ping the service once an hour to check if it’s back up, resuming the deduction attempts only after the payment service is restored.

The next issue is thread starvation. While the retry requests are in memory, threads remain occupied. If new requests to the same service also get stuck, eventually, we’ll run out of threads. The result is an Out of Memory (OOM) error, meaning our own service may stop responding because of this.

If our service goes down at that moment and then restores, no one remembers what happened in all those suspended threads. It’s clear that some requests were being processed, but which ones… In other words, we lose all context. We cannot continue working; everything has to start over.

To tackle this problem, it makes sense not to hold onto threads, but to organize something like a queue, i.e., to retry from an intermediate database. It would be convenient to create a background thread that would retry these messages.

So we are going through a long journey with REST APIs, from a simple call to an needed system that delivers messages at least once (it can be more than once, which is not a problem) and is eventually consistent, meaning it is not consistent right now, but will be consistent eventually.

Next, the question of load balancing arises. If the payment service goes down and then comes back up, everyone who previously tried to reach it will start bombarding it with requests that are queued up. However, it may only be able to handle one request per second. Therefore, we need to operate more quietly, balancing the load; otherwise, the payment service will have to start dropping some requests using a rate limiter.

Somewhere around this stage, the idea arises that since we already have some persistent queue, why not use Kafka? Essentially, it addresses the same problems, just positioned outside our service rather than inside. So we can simply replace the previous queue scheme with one like this:

We will simply place messages from our application into Kafka, and the payment service will retrieve them from there. Unfortunately, Kafka does not qualitatively solve these problems, only quantitatively. After all, if the queue is unavailable, we find ourselves in a situation similar to the one with retries. If Kafka is unavailable, the service is left shrugging its shoulders and has to suspend its work. That’s why a very high availability is sought for the queue. By ensuring high availability for Kafka, we increase the availability of services, decoupling them from direct requests to each other.

Thus, we turn to Kafka when we realize that even with REST, everything is indeed complicated, and a multitude of issues arises that Kafka helps to solve. It simplifies interactions from our application’s side — it’s easier to put data into Kafka than to build complicated queue systems. Additionally, it provides features that REST cannot offer, such as client-side load balancing.

We also achieve proper decoupling. It doesn’t matter to us whether the service is available or not at the moment — we just push everything into Kafka, and the system becomes asynchronous and slightly more stable. We move away from the anti-pattern of fragile architecture.

Anti-pattern of Fragile Architecture

Let’s assume we have 50 microservices that synchronously call each other via REST. If one microservice is turned off, the chain of calls will eventually reach it, and it will be unable to process the request at that moment. We can say that this way, all services can end up failing at once.

If we replace REST with Kafka, we won’t radically improve the situation — one of the microservices is still down. But we certainly won’t make it worse. The other services are available and can at least try to respond from the cache or in some other way (and at least avoid creating unnecessary traffic from retries).

Fragile architecture is a hypothetical example. Replacing everything with Kafka is not always advisable, as queues complicate maintenance. You lose a message, and you can’t even figure out which microservice failed to send it to which queue. This often requires building distributed tracing to see how requests flowed, setting up monitoring, and so on.

Overall, the industry consensus seems to be: if you can avoid using microservices, don’t use them. The same goes for Kafka. If you don’t know why you need Kafka, it’s likely that you don’t need it. It appears that not everyone understands the complexity they bring when integrating Kafka into a project — they slip into a cargo cult, adopting Kafka merely for the sake of the tool.

It’s important to assess the complexity of your system and the cost of an error for the business. There can be very simple systems with low error costs — if a client comes in and you don’t serve them, you lose 10 cents. Such risks can be ignored. The key is to avoid finding yourself in the position of building yet another Kafka. If the cost of an error is $10,000 and a lawsuit, the situation takes on a more serious turn. In that case, it makes sense to protect against that error in every possible way, including by increasing complexity.

Why Kafka is So Popular

In fact, there are many message brokers available. However, it seems that even with RabbitMQ, everyone is gradually moving to Kafka, as it is a good open-source solution that can be flexibly configured for different use cases. It’s as if, at a certain point, everyone just believed in the concept underlying Kafka and started applying it everywhere.

Kafka performs well in terms of performance. For the same topic, you can add partitions, and performance will increase almost linearly. Millions of messages can flow through it; the only question is whether you are willing to pay for the servers.

How Kafka Works

The workflow typically starts with a producer, which writes a message to Kafka over TCP/IP, making a blocking synchronous call to the cluster — specifically, to one of the servers that is currently the leader (in this respect, the situation is very similar to REST).

Inside Kafka, there are always several servers. One will be the Leader, while the others are replicas. In fact, they duplicate each other’s data. We interact with the leader, while the replicas copy messages from it and are ready to take over if the leader goes down. This provides a good guarantee of availability.

Similar to the two generals’ problem, the producer waits for Kafka to confirm that the message has been received. Although the code makes it appear that each message is sent instantly, the producer actually sends them in batches, storing them in an intermediate buffer. The first setting to discuss is the method of sending messages:

  • Synchronous: The producer sends the message and waits for a response. With synchronous sending, we can guarantee the order of message delivery.
  • Asynchronous: The message is sent, but the producer does not wait for delivery confirmation and continues to send subsequent messages. This can disrupt the order of sending.

By default, many use asynchronous sending.

Another producer setting is the requirement for confirmation, which balances speed and reliability:

  • 0: The Kafka cluster does not need to respond to us whether it received the message. This is the fastest option.
  • 1: The Leader — the server in the cluster we are contacting — must confirm receipt.
  • All: The Leader sends a confirmation after all replicas have received the message. This is necessary in case the Leader suddenly stops. In fact, this provides a 100% guarantee that the message will be received and not lost, but at the same time, it is the slowest sending method.

Choosing between the method of confirmation (or the absence of it) is another trade-off.

To send messages, the following data is needed:

  • Addresses of all the servers in the Kafka cluster, as they are specified in the configuration. If one replica goes down, you can connect to another and continue working with it.
  • The authentication scheme for the Kafka cluster (login/password, certificate, or something else). It is possible to configure it for anonymous sending to Kafka, but serious players usually implement authentication.
  • The name of the topic. There are often conventions for naming topics. Changing these names is complicated and painful, so it’s better to come up with good names right away to avoid renaming later.
  • The maximum size of messages, which is usually set to 1 MB. This can be increased, but it seems better not to play with it. For larger messages, you can use S3, sending only the link to the file in Kafka. In my current project, that’s how it works — we have huge gigabyte-sized JSON files compressed using a columnar format and stored in S3. A link to these compressed files is added to the Kafka message.

Inside the server, there is a topic that is divided into partitions. Essentially, partitions are our queues. Messages are stored in the form of a log, allowing for forward and backward navigation — reading sequentially or jumping to a specific point to read something again.

The number of partitions is configurable. You can create a topic with a single partition, but then consumers will only be able to read it in a single thread (in one instance). Different services can read from the same partition in parallel only if they are in different consumer groups. This means that multiple instances of the same service cannot read from the same partition, which limits scalability. If we need load balancing on the consumer side, we need to create the required number of partitions.

For an existing topic, increasing the number of partitions can be done very easily. However, reducing this number is likely not possible. It is almost simpler to drop one topic and create another if you suddenly need to decrease the number of partitions.

Sometimes the order of message consumption is important — they must be read strictly in the same order they were sent. In this case, you should send them to one partition using the correct key. For the key, you can use the User ID. It is crucial to understand these nuances during the system design phase.

I mentioned earlier that Kafka writes messages to an infinite log. In theory, we can avoid deleting anything from this log. However, it will continue to grow, so it’s advisable to configure a message deletion policy for those messages that have already been processed. Unfortunately, Kafka’s logic does not allow it to “know” whether a message has been read or not (Kafka keeps track of offsets in the log, but this does not mean that all previous messages have been read, as offsets can be manually adjusted).

For each topic, you can set a maximum size or a time period after which messages can be deleted. This configuration can be a pain point because a message may have been recorded but not yet read, even if a considerable amount of time has passed. It’s important to understand what limitations we are setting and whether they will be sufficient. For example, if you set a volume limit but have a very intense message flow (millions per day), you will quickly reach the size limit of the topic, and all messages will start to be deleted before processing. It’s better to base this approach on business logic — how critical is it for the business that data older than a certain time is lost? Is the business willing to pay for the storage of old messages?

Additionally, there is an interesting nuance in the message deletion settings related to European projects. They are subject to GDPR, which can lead to the opposite situation — where retaining messages containing personal data for a long time is problematic.

By default, consumers retrieve messages from Kafka over TCP/IP not continuously, but every 5 seconds. This interval is configurable. By default, messages are also transmitted in batches of 500. The consumer must regularly notify Kafka that it is alive. In one of my projects, both message retrieval and this health check were conducted in the same thread. We processed the received messages for so long that we didn’t have time to send the health check — Kafka would rebalance and disconnect what it thought was a non-functional consumer. There are many such nuances, including those related to committing offsets in the partition. I recommend understanding these.

Warnings

If Kafka has been introduced to a project, it’s wise to allocate time to study it in more detail. Many different things can go wrong with Kafka. If there’s something that can go wrong, sooner or later, it will. Therefore, it’s essential to understand what you are doing.

The advantage and at the same time the drawback of Kafka is its flexibility. You can configure a lot of custom scenarios. However, it’s important to realize that it’s like working in Linux — you will have to set it up yourself. You can’t just take it and use it for your purposes because there’s a high probability that something won’t work as expected, and initially, you may not even suspect it. You will find out later, through pain, errors, and losses.

It’s one thing to set up a log collection through Kafka. In this case, you’re collecting a lot of data that you’re not afraid to lose. It’s another case entirely when it comes to payments through Kafka in fintech. In this scenario, messages cannot be lost at all; you cannot charge someone twice or miss a specific timing. These are completely different cases, and you must approach Kafka differently in each.

Kafka has many settings that need to be adjusted. You must familiarize yourself with them. Even better, read a few books because a basic “Hello world” example can be whipped up in an hour. But to truly understand how it works, you’ll need to spend at least a couple of weeks on initial training.

This article was written promptly after a training session on distributed transactions by Dmitry Litvin.

--

--

Maxilect
Maxilect

Written by Maxilect

We are building IT-solutions for the Adtech and Fintech industries. Our clients are SMBs across the Globe (including USA, EU, Australia).

No responses yet