How We Migrated Servers to a New Data Center and Somehow Didn’t Lose Our Minds

Maxilect
10 min readNov 30, 2023

--

Today, we want to share our experience of moving our own project from one data center to another. Here, you won’t find useful solutions on how to move optimally. This is a post about the pain of what migration looks like in the real world, where you can identify some issues during the testing phase, while others inevitably arise in production simply because it’s practically impossible to stress the hardware synthetically in exactly the same way.

We deliberately refrain from naming the hosting provider — where possible, they cooperated. The essence of the post is not about the shortcomings of a specific company but rather about the problems one should anticipate when planning infrastructure modifications.

Introduction

In addition to participating in client teams, Maxilect develops its own product — Mondiad advertising platform — an advertising exchange with a client’s interface to create and place advertising campaigns. We’ve written about this project several times before.

The project has been evolving for several years. We have dozens of servers that handle hundreds of thousands of requests per second (with low latency). The server load is quite specific — we don’t distribute static content; instead, we process incoming requests, so the processors, memory, and disks are fully utilized. All of this happens automatically without human intervention.

The peculiarity of the business model is that any interruption is not just about missed opportunities but actual financial losses. Consider this: it’s one thing to not display ads for about 10 minutes (that’s a missed opportunity), but it’s an entirely different situation when real user clicks are not processed. That results in direct losses.

The product operates in the European and American markets. Accordingly, the servers are located in Miami and Amsterdam, sharing the load roughly in half. The nuance is that our key systems are in the USA. If the American data center goes down, we lose not just half of the revenue, but 100%, as the system simply stops working. Today, we’ll talk about migrating precisely this “critical” part of the servers — we moved them from Miami to Detroit.

Why Move?

On the Miami platform, our hosting provider rented racks from the data center owner to place our own servers. Initially, we focused on server rentals as it was both cheaper and fewer resource costs for virtualization (even though they were minimal, every millisecond mattered to us). At the start, we began with a single rack, which proved cost-effective. As our system expanded, we added more servers, eventually reaching two full racks. Additionally, we rented two switches separately, allowing us to implement our dedicated internet channel — we had full control over this subnet.

Recently, our hosting provider built their own data center in Detroit and proposed that we move. Clearly, it is cheaper for them to manage servers on their own premises. Detroit attracted us because the new data center featured more up-to-date hardware. Previously, we were using Intel Socket 2011 v1-v3, and now we have AMD Ryzen 5950x (processors launched on 11/5/2020). Moreover, they offered more powerful servers for roughly the same cost as our old setup.

For us, the primary goal of the move was not just about hardware upgrades. Our project had outgrown the startup stage, and we wanted to redesign the infrastructure to make it more reliable. We needed a power reserve for servers, redundant power supply, and a more fault-tolerant network scheme. While theoretically, we could obtain a power reserve in Miami, everything else could only be addressed through relocation. Since our hosting provider was a tenant himself, his capabilities on the old platform were limited.

The decision to move was made at the beginning of 2023. Our plan was not to simply replicate servers in the new location but essentially to rebuild everything, aiming to complete the process by September 1. One might wonder, what could possibly go wrong?

Three Stages

Any relocation of critical business systems must start with testing. After all, we were transitioning to a different architecture — from Intel to AMD. For our software, this hardware was new, and there could be nuances.

The migration took place in three stages — first, we ran tests under synthetic load, then gradually switched traffic from Miami to Detroit, and finally, we caught the remaining bugs. Let’s discuss each stage separately.

Synthetic Tests and the NVMe Disk Issue

It’s worth noting that everything was fine during the machine acquisition stage — the Baseboard Management Controller (BMC) uptime for everything the host allocated was about a year. However, with each new machine, we ran our tests — evaluating the performance of the processor, memory, and disks, trying to simulate production load.

The first problems appeared precisely at this stage. The performance of NVMe disks was significantly lower than expected, fluctuating from test to test and from machine to machine. This should not have happened since the machine configurations were identical.

Dealing with the hosting provider to address hardware issues is akin to interacting with regular customer support for individuals. The standard response to any question is, “We don’t know; everything is working on our end.” For each subsequent episode, we had to spend a long time proving that the problem was indeed with their equipment or settings, granting access to our systems so they could see the metrics. When they finally acknowledged that the “ball was in their court,” they took quite some time to think and find a solution.

Throughout this communication, we resolved the disk issue. As later analysis revealed, the drop in performance was due to overheating, also the situation varied significantly from server to server.

Since the problem was with the hardware, there was nothing we could do, and we waited a couple of months for the hosting provider to address it. They approached it in several iterations — updated the disk firmware to a newer version, replaced risers, and, towards the end, added fans. Normally, the machines had four coolers each, but to address the disk issue, they had to install five or even six coolers in some cases — there was no other option due to the hardware layout.

Primary Traffic Switching

After completing synthetic tests, we gradually began to redirect “live” traffic. Knowing that synthetic tests cannot replicate real-world situations, we proceeded in several steps.

First, we directed 33% of production traffic to a third of the servers. At this point, despite preliminary performance tests, we noticed that some machines experienced a drop in requests per second (QPS). Difference was significant compared to other machines in the cluster. Upon studying the metrics from these machines, we found that the cause was processor throttling — when they approached a critical temperature, they reset the frequency. This was observed only on one of our services — it had a specific load simultaneously on the processor, memory, and network cards that synthetic tests could not simulate.

We resolved the problem fairly quickly. Initially, in response to our complaints, the hosting provider changed the thermal paste. After several replacements, they started improving forced cooling again — they ordered high-performance coolers (22000 RPM) from the vendor and replaced the stock ones (12000 RPM).

Then, we added another 33% of production traffic to the new data center, waited again, and on the last iteration, shifted the remaining traffic. It was at this point that we encountered a whole set of problems that had not manifested themselves before (all for the same reason — synthetic testing simply could not construct a similar situation).

Apart from minor difficulties, we faced three major problems.

Overheating, again…

Despite the measures taken earlier, problems with overheating reappeared after connecting 100% of the traffic. On production, the hard drives overheated again, sending a bunch of alerts — the temperature reached up to 50°C, at which point the disks quickly degraded. Following the familiar routine, the hosting provider ordered powerful coolers and replaced the stock ones a couple of weeks later.

Why didn’t all this manifest itself earlier? Apparently, it was influenced by the location in a specific rack or the position within the rack. It’s known that the “coolest” places are at the bottom. The difference is small — 1–3°C — but it was enough for us.

Spontaneous Reboots

The next major challenge was the spontaneous rebooting of servers. They worked without issues, passed preliminary testing, but after being deployed in production, they randomly shut down and then restarted. Moreover, this could happen even without any significant load.

We think this is simply the reliability curve in action. According to reliability theory — this was the initial period when some hardware fails due to various issues such as faulty motherboards, processors, etc.

Throughout the entire problem period, it manifested on only 12% of servers. However, the shutdown of individual servers in our case results in us missing requests from external systems and losing money. The more of these skips, the worse the situation.

Since there was nothing we could do about spontaneous reboots on our end, we returned such machines to the hosting provider. They provided replacements with a similar configuration.

Network Issues

At random times, on random network connections (some external, some internal), we identified losses — a business metric went beyond critical values. Investigating the issue, we found that it had been there from the very beginning; it was just unnoticed with low traffic.

Our product has an extensive number of metrics, configured dashboards, etc. However, for understandable reasons, it is impossible to monitor everything (our resources are not unlimited). In standard mode, we monitor key system and business indicators. If they exceed the “normal” range, we start digging deeper. For example, we have an important business metric for skips — interrupted connections with external systems. As long as individual connects are lost, we consider the situation normal. Once skips become significantly more frequent, we check the logs. Unfortunately, this approach means a delay in identifying some problems, but it allows us to optimize maintenance a bit.

As we discovered while analyzing the growing number of skips, a switch in one of “our” racks turned out to be faulty — there were losses of 2–3% on small packets. Perhaps, for other workloads, this is not critical, but for us, it resulted in serious losses.

The switch was replaced (again, on the hosting provider’s side) — the problem was resolved.

While we were sorting all this out, a suspicion arose that we were “lucky” to be the first guinea pigs in the new data center. According to the hosting provider, this is not the case — one major client had already moved to the DC before us. However, if they use these machines for typical tasks like content distribution or website operation, they might not have noticed all these issues. Once again, not many people conduct thorough entrance testing.

Move of Deployment

Closer to the end of the relocation, we started transitioning the release build and their automated testing for basic functionality to Detroit. We faced the issue that some automated tests began to fail. We resolved this problem by using a local machine with an AMD Ryzen processor. It happened that one of the developers had a computer with the same processor as the new servers. Therefore, he could easily reproduce all the issues with automated tests locally and fix them relatively quickly.

Emergency Acceleration

The date when our servers in Miami would be disconnected had been discussed from the very beginning. However, a classic misunderstanding of timelines occurred. We said we “plan to move by a certain date,” but the hosting provider heard “we will definitely finish by this date” and coordinated with the data center owner to reconfigure the infrastructure on those dates. In reality, this was one of the relocation mistakes — we should have clarified the timing from the very beginning. Later, it came back to affect us.

From September 1st, the hosting provider planned to consolidate the remaining machines into fewer racks and carry out some power-related work. The list of machines that would be turned off for an unknown period was known in advance — according to this list, we prioritized the relocation process. At some point, we decided to clarify whether only machines or entire racks would be turned off. It turned out that the process would affect not only the servers on the list but also the switches, as they are installed in racks that would be completely disconnected from power. As a result, for some (completely unpredictable) period, we would be left without a segment in Miami.

All that forced us to speed up. We prepared a “plan B” — we had previously coordinated with the hosting provider to postpone the disconnection dates by a couple of days. But in practice, we managed to meet the deadline by September 1st after all.

After the Move

Although we have found solutions to some problems, at the time of the move, everything was not very stable under heavy loads — we observed some traffic skips. We simply “brute forced the problem with servers.” This was not the fastest solution either — ordering servers takes time, and they need to be tested before putting them into operation in any case. But when we connected additional hardware, the level of skips dropped to an acceptable level, allowing us to operate for some time.

Of course, launching additional servers is not the ideal solution, we continued to experience problems. Now we’ve caught almost all of them — our skips are practically zero. However, we still operate twice as many servers as needed to handle the traffic — this ensures a buffer for double the growth in load.

In the last month, we dealt with post-migration tasks, finishing up details, and solving minor issues that emerged after the migration: somewhere we didn’t manage to set up backup, somewhere we added and removed metrics, somewhere we adapted to the new hardware. Aerospike, for example, has its quirks when working with NVMe — it’s preferable to have 4+ partitions on each disk; otherwise, all disk exchanges go through a single PCI-Express line. Previously, we only kept its data on SATA-SSD, and there we didn’t encounter this problem due to a different architecture.

Conclusions

As planned, we have built a more resilient infrastructure. We secured a performance reserve twice the previous capacity (previously, this reserve was no more than 15%), duplicated power sources, and restructured the network — finding a more optimal balance with load balancers and, overall, implementing a more stable scheme. All of this was unavailable in Miami.

The new infrastructure costs us no more than the old one. However, we certainly experienced firsthand that “moving is worse than a fire.” We were prepared for the fact that some problems would only appear in production. You can’t catch all the problems with synthetics alone. Hence, we devised a scheme with a gradual traffic switch and increased capacities on the new platform.

The only serious mistake we made was not initially working out a contingency plan in case we didn’t meet the deadlines. Thanks to reserves built in other areas, we did manage to succeed eventually. But it was a close call!

Thanks for assistance in preparing the article to Igor Ivanov, Nikolay Eremin, Alexander Metelkin, Denis Palaguta, and Andrey Burov.

PS. Subscribe to our social networks: Twitter, Telegram, FB to learn about our publications and Maxilect news.

--

--

Maxilect

We are building IT-solutions for the Adtech and Fintech industries. Our clients are SMBs across the Globe (including USA, EU, Australia).