Many people think about starting an application in the cloud but rarely pay attention to how it ends. Once, we caught quite a few errors explicitly related to pods stopping. For example, we saw that Kubernetes occasionally kills our application before it releases resources, although it seems that this should not happen. It was impossible to reproduce the problem immediately, and we wondered what was happening under the hood?
During our research, we found several points that need a graceful shutdown in our service. With a couple of examples in this article, I will show why it is crucial to think about this and how you can approach this task.
We develop on Kotlin/Spring Boot. The project runs in the cloud, and Kubernetes manage its life cycle. We configure how our applications should live, and Kubernetes takes care of the rest without asking us for details.
We didn’t understand how exactly it works within the application life cycle — what signals it sends to pods, when it does, and how they process it. In this article, I will show two simple examples showing how vital a graceful shutdown can be.
But I’ll start from afar — with how it happens in a general environment.
OS Application Life Cycle
To communicate something to an application, the OS sends it a signal with a specific code. The idea appeared in POSIX-compatible operating systems and is still used. Now there are about 30 different signals, but I will only mention here those that relate to the termination of the application:
- SIGINT is a signal that should terminate the running application “in normal mode” (without haste). The same happens to a Java process when clicking Stop in IDEA. By default, SIGINT is considered to terminate the process in interactive mode. On receiving SIGINT, the process requests confirmation from the user or uses the signal in another way (including ignore).
- SIGTERM is the default process termination signal. This is the default signal sent from the console by the kill command in Linux. But the process still has a chance to terminate the threads, release resources, or even ignore the signal.
- SIGKILL — unambiguous instant termination of the process without releasing resources and terminating threads.
This should be alarming if you constantly have to click on a shard in IDEA (similar to SIGKILL) to kill a process or test. When service cannot be stopped with SIGTERM or SIGINT, the SIGKILL has to be used. In this case, it is quite possible to lose the request or fail to write something valuable to the file, catching nasty bugs.
Singal Mess — Kubernetes and Spring Boot
When managing pods in a cloud environment, Kubernetes also uses signals, sending them in accordance with its internal logic. It can restart the pod simply because he decided to move it to another node without any command from our side. He picks up a new pod and then kills the old one. It would seem that the application continues to work, but it already has encountered something we did not expect.
Kubernetes does not require processes to be terminated immediately. Sending a SIGTERM to the container waits for some time (the timeout is configurable, the default is 30 seconds) and only if the process has not terminated after the timeout sends SIGKILL.
What could possibly go wrong here?
SIGTERM signal, but reaction as SIGKILL
When trying to stop Apache Tomcat, Kubernetes sends SIGTERM to Apache, but instead of the expected “regular” process termination, Spring Boot instantly stops the web server and interrupts threads. Processing of all incoming requests stops — the server returns an error 503.
The new Spring Boot has a special setting in the config:
The setting allows you to implement everything more logically. Upon receiving a SIGTERM, the server stops accepting new requests, tries to respond to existing requests in a reasonable time, completes heavy requests more intelligently, and does it all before the SIGKILL arrives.
SIGKILL and hung jobs
Among other things on the project, we use the Spring Batch framework for all recurring jobs (we use the @Scheduled annotation in Spring Boot to start). Under the hood, it has its database, which stores information about what was launched, when, how it was processed, and what the result was.
If you kill a Spring Batch application while running with a SIGKILL signal, then a “hung” job will remain in the launch history. It will forever stay in the “launched” status.
We avoid simultaneous execution of multiple instances of the same batch, so a “hanging” job blocks the re-run of the batch. It is repaired manually by deleting a hung job from history.
We implemented a graceful shutdown for batches, following the same logic as in the previous example with the web server:
- upon receipt of SIGTERM, we prohibit the launch of new tasks;
- trying to complete all running tasks;
- we wait for some time (no more than waiting for Kubernetes itself);
- forcibly complete all long tasks (we mark them accordingly in the database).
Profit — when SIGKILL comes from Kubernetes, all resources are already released.
These two examples are worth considering as areas for thought where the behavior of a cloud application can be improved by handling process termination more carefully.
Author: Dmitry Litvin, Maxilect.