Hi! This is Sergey Kalinets from Parimatch Tech, and today we’re talking about the resilience of our services.
A long time ago, at the start of my career, an interesting thing happened. Back then, we used to write desktop Delphi applications for Windows, and we had the “something is wrong” meme. As always, the meme was funny, but the situation was terrible. One developer discovered exceptions, namely, the ability to catch them. So, to avoid errors in the application, he would catch them for all possible blocks of code, so there was just a modal window displaying the phrase “Something is wrong.” The details of what exactly was wrong remained a mystery, so we needed to call in the big guns for the investigation—the developers. The error occurred on the user’s computer, and there were no logs of course. So the search for reasons was like looking for the proverbial needle in a haystack. All in all, it was a masterclass in the importance of having information about your failures.
Later on, the focus of my work shifted to backend services. That’s when .NET, IIS, and Windows services appeared. Passing errors through dialog boxes no longer worked (although there were funny cases when a service on a remote host showed a message box and refused to continue working until someone clicked OK. The most interesting thing about it was that it had to be performed physically by someone near the server. You had to write logs, then look at them, and try to figure out what exactly went wrong. However, in the era of the monoliths, this was also relatively simple. Back then, you only had one application that could take you years to develop. It ran on a server with a cool name like PETROVICH, which the development team was familiar with (and sometimes you could even look at it in the server room). Errors were fixed by restarting the service; analysis was conducted by connecting to a remote desktop, and deployment was carried out by copying files. In general, there was a cozy, homey atmosphere. And then microservices came along.
Things got more complicated with microservices. The number of points of attention had increased significantly. We had not one, but multiple services, and it wasn’t clear where they worked. There might not be any physical access to the server at all, but even if there was, it is much more difficult to keep track of everything now than it was with the monoliths. Things that had seemed like optional extras have become a necessity—mechanisms were required for ensuring the resilience of services, and it was highly desirable that they were automatic.
What is service resilience? In simple terms, it’s the predictability of its working under adverse conditions—as a rule, it’s about the failure of its dependencies (network, base, bus, and everything else that the service can’t control).
Predictability lies in the following two points:
- avoidance of data loss during a failure;
- fast and reliable recovery of work after a failure;
To ensure service resilience, a number of patterns and approaches have been invented (for a detailed study of the issue, I highly recommend the book Release It!). You can use them in both the service code itself and the infrastructure or you can combine them.
To start solving a problem, the first thing you need to do is find it. How do you know whether or not the service is working properly? The easiest option is to conclude that while the process is running, everything must be OK. But it doesn’t always work like that. The process may be running but not doing anything useful. For example, there could be a deadlock within the main cycle. It would be good if the service could somehow indicate its current state so that external agents (people and other services) could respond appropriately to any changes. In fact, this can be done through so-called health checks (which you can think of as medical checkups☺), and there are also some ready-made solutions for the main stacks that microservices are written on. For dotnet, for example, you can find the details in the official documentation. The statuses of dependencies are often added to these checks, proceeding from the fact that if the base is unavailable, then the service can’t be operational and should show the status “unhealthy.”
So, the diagnostic is clear. But what about the treatment? Of course, a different treatment could be tailored for each service, but typically, you can apply one of two actions:
- Service restart. You’d be surprised just how many problems you can solve by restarting. What’s more, the “restart first, figure out what happened later” approach is great for reducing downtime.
- Taking a service instance out of the pool. The main idea here is not to give traffic to problem services.
You can perform both these actions manually, or they can be automated. That’s exactly what orchestrators such as Kubernetes (as if there are any others ☺) do. Let’s take a look at how Kubernetes does it.
When describing a service manifest, you can specify the so-called probes. Briefly, these are checks performed according to a specified schedule, which can return one of two statuses (working/not working), and they have the following three types of semantics:
- liveness (is our buddy alive?). If this probe fails, the orchestrator restarts the service instance;
- readiness (is it ready to work?). If not, the orchestrator removes the instance from the load balancer list;
- startup (has it finished initializing?). This probe is only run at the start, and if it’s positive, the orchestrator can include this instance in the load balancer list.
There is an interesting misconception about the last two checks. They may seem to be one and the same, but, of course, they are not. Readiness is not only run at the start but also periodically throughout the life of the instance. This comes as a surprise to many developers, who assume it only works at the start. And this check can be helpful when the instance is “busy” with work so that it cannot accept new requests. In this instance, by the way, readiness is similar to liveness—both are run at specific intervals during the lifetime of the application. The startup is run exactly at the start, and as soon as there’s a green light, it stops. It solves the problem of a long start, which some frameworks (like Java or .NET) are subject to. In this case, services take a certain amount of time to initialize; during this period, they are not very friendly to incoming requests. With heavy traffic, if you open a new instance for incoming requests straight after the start, you can get a lot of errors.
By the way, another typical misconception is that with Kubernetes both readiness and startup probes only work with traffic that enters the instance through the load balancer—incoming HTTP, for example (I’m simplifying slightly, but in essence, it doesn’t change). For so-called workers that process Kafka or RabbitMQ messages, neither of these probes restricts the flow of these messages in any way. Yes, any pods that don’t pass these checks will be displayed as “not ready,” but at the same time, they can work.
So, we’ve learned (or refreshed our memories) about checking the health of services and the ability of orchestrators to respond to changes in the state of services. And then the question arises—what if we combine all this and get fully automated maintenance of the stability of the service? Sounds great, doesn’t it? But it’s not that simple (otherwise I wouldn’t be writing this article).
Let’s imagine that we have a web service that can scale and works with a database. How can we improve its resilience? For the sake of simplicity, let’s consider a scenario where the connection to the database is unstable. The classic approach is to add health checks and use liveness and readiness probes. Will this improve the resilience of the service? Oddly enough, the answer is no.
As we agreed earlier, our scenario is an unstable connection to the base. Sometimes it gives the results, sometimes it doesn’t. When it doesn’t, in addition to an error, we also observe delays caused by network timeouts. Let’s say that we’ve implemented a health check that verifies the base and, depending on the result, determines the status of the service. If the database is available, the service is healthy and vice versa. Let’s check what we’ll get by adding the probes that check this status.
Liveness. If the base is unavailable, the service will be restarted. Will this save the situation? Most probably not. A connection to the database is usually established for every request. If we did nothing, we would receive an error in the case of an unavailable base, and the expected result when the base responds. With the liveness probe, we get practically the same thing + service restarts. Restarts themselves bring potential problems—while the service is starting, it may return errors, even when everything is okay with the base (as discussed earlier for startup probes). Also, Kubernetes adds an interval between restarts, and this interval can increase in the event of multiple unsuccessful starts, which will lead to even greater delays. So, there is no benefit, only potential harm.
Okay, then maybe readiness can help us. Indeed, the algorithm is simple: the database is unavailable—we do not give traffic to the instance; the base appears—we open the gate. What could possibly go wrong? Well, firstly, the probes are run at a given interval, which means we can get a situation where the traffic is open, but the base is down, or vice versa—the traffic is closed, but the base is alive. Both options generate errors and don’t improve resilience. Secondly, probes are run for all instances and check the same resource. This can cause a problem with the database by making all the instances unavailable.
By the way, the last argument is also true for liveness probes. There is a recommendation not to use the statuses of external services for probes. Probes should only check local status. For example, liveness can just check the endpoint without logic like /ping or /info. If they stop responding, it means that something has broken in the service, and this particular instance can’t process incoming requests. Or, such a probe can check the local file, where the service periodically saves the current time. If, during the check, we get an outdated time, it may indicate some kind of deadlock occurring inside the service, and that it has broken at least one periodic process, which means that it could have broken something else as well. Both problems are easily solved by restarting the problematic service.
The question may arise: what if we put logic inside the service that will answer the question “is there something broken that can be repaired by restarting?” Then you can use a liveness probe for it, and voila! But in fact, if we have this logic, it doesn’t need any help from the orchestrator—it can just terminate the process by calling, let’s say, exit (). As a result, the number of running instances will become less than required, and Kubernetes will bring up a new one. It will be exactly the same as what happens with liveness probes but will require fewer actions and the restart will be faster (no need to wait for the probe to work). This approach improves the overall service resilience because it minimizes the length of time the service remains in a broken state. There are some excellent instructions at RedHat about which probes you need and when you need them. It’s written in English, but the text is understandable, and the pictures are really useful.
This is all great, but what do we get in the end? We have service-health metrics that test external dependencies. And we have just found out that using these metrics as probes is not a good idea. Moreover, the probes themselves are not always good for improving resilience. There may be a situation when the service shows that it is unhealthy, but at the same time, it processes requests as usual. This is all about the frequency of checks (when the database was checked, it was unavailable, but then it became available, and the service status will be updated only after the next check, although thousands of requests can go through the service between checks, and some of them can be processed successfully). So what do we do?
The answer is to use the stability patterns that we mentioned at the beginning of this article. These patterns can help reduce the impact of external dependency problems on the stability of our services.
In the next article, we will examine the application of these patterns on the example and measure the impact of everything that we’ve already discussed on the resilience of services in real tests that you can try on your own.
In the meantime, we await your comments on what we’ve missed and where we were wrong ☺ Thank you for your kind attention, and resilience to everyone!