Services

Cases

About

Contacts

Callback

+35796770828

info@wiseops.team

Blog

+35796770828
info@wiseops.team
Chanion 11 / 101, Limassol, Cyprus, 3100, IMHIO Ltd

Main
→
Blog
→
Disaster Recovery Plan: How to Properly Brew Tea When the Server Room is Ablaze

Disaster Recovery Plan: How to Properly Brew Tea When the Server Room is Ablaze

It's high time to open the red envelope

In the life of any project, disaster is inevitable. We can't predict exactly what it will be — a short circuit in the server room, an engineer dropping the central database, or a beaver invasion. But rest assured, it will happen, and for the most absurd reason imaginable.
About the beavers, by the way, I wasn't kidding. In Canada, they chewed through a cable and left the entire Tumbler Ridge area without fiber optic communication. It seems like animals are on a mission to suddenly cut off your access to resources:

Macaques chewing on wires
Cicadas mistaking cables for branches, and digging them out to lay eggs inside.
Sharks biting into Google's transatlantic cables
And for a major telecom company, Level 3 Communications, squirrels were actually at the top of their problem list.

Sooner or later, someone or something will definitely break, drop, or load the wrong config at the worst possible moment. And here's where the difference lies between companies that successfully navigate a catastrophic failure and those who run around in circles trying to patch up their crumbling infrastructure - the DRP. That's what I'm going to talk about today: how to write a Disaster Recovery Plan properly.

When to Start Writing Your DRP

Every hour an employee spends is money out of the pocket. An hour from an expert, who knows every makeshift fix and understands what could break, costs even more. Plus, often the expert and someone who can write in plain language are not the same person. Bottom line, a good document isn't cheap, and you need to know exactly when it's worth the investment.
You definitely need to get on it if:

You have a clear failure point, but no straightforward instructions for its recovery.
A server, node group, or database crash could bring your business to a halt and start a fast-spinning loss counter.

Really, this all boils down to money first and foremost. Your office's Telegram bot for ordering coffee might crash, but you've got the source code lying around somewhere? It can wait, we'll fix it when we have a moment.
Cluster of Hashicorp Vault became unresponsive because of a connectivity breach between data centers? Your billing system will feel uncomfortable without passwords. Here, you definitely need a DRP, which you could align with an architecture refactor.

Write Documentation

Rarely do people enjoy writing documentation. I, for one, actually like it because it helps me shift away from routine tasks and focus on the architecture's transparency. Nonetheless, it's a critically important story since a few dozen minutes spent on a note could later save you thousands of dollars in a serious crash.
Straight off the bat, I want to highlight an important point: troubleshooting documentation is crucial, but it's not yet a DRP. We stick to the following approach:

The employee who understands the system best writes the information system passport. It includes the basic description, why this thing is even needed, which systems are connected to it, and who to run to if something breaks.
An on-call finds a problem in the logs, or detects an outage. If fixing the issue requires some non-obvious actions, they document every detail, down to the list of console commands, and add it to the Troubleshooting section.
If the outage recurs - we create a ticket to eliminate its cause.
If it's impossible to fix the cause due to architectural or financial reasons, we attach to the alert description a link to internal documentation describing how to fix it.

Just like that, we've got a basic process allowing any employee to quickly solve a problem.

Turning On Our Inner Paranoid

In general, we've got documentation, we understand how to fix the main problems, but we still don't have a DRP. For a real Disaster Recovery Plan, we first need to decide on a threat model.
There's a relatively limited list of what can break, usually boiling down to a small number of categories like this:

Application failure
OS services failure
Network failure
Hardware failure
Virtualization failure
Orchestration system failure

At this stage, we usually gather the whole team, pour some coffee, tea, and start thinking about how we can most effectively break our system. Bonus points and pizza delivery go to whoever finds a way to bring it down for the longest time with minimal impact.
As a result of such work, our backlog usually grows sharply due to tasks like "If script N runs before script M, it will irreversibly destroy the central DB. We should fix that." What remains after clearing the technical debt is your threat model, including those very recidivist beavers, the cleaner in the server room, and other natural disasters beyond your control.

Accept the Inevitable and Go Make Tea

So, you've come to the realization that everything will eventually break down. You've considered all possible factors, including your radio relay line being blocked by a festival of flying balloons and a new fire at one of your providers.
Now, it's time to write the DRP itself. No, it's not that extensive troubleshooting section you wrote earlier. The DRP, in essence, is like a red envelope containing instructions such as:

Pick up the blue phone receiver.
Dictate to the operator: "Wish - Rusty - Seventeen - Dawn - Furnace - Nine - Kind-hearted - Homecoming - One - Cargo Car."
Wait for the reply tone.

Even if your team consists exclusively of top experts in their field, the document must be written to be understandable to someone with an IQ no higher than 80. Real-world scenarios have shown that in moments of severe outages, an engineer under stress often doesn’t just fail to rectify the situation but might actually make a barely alive system definitively dead.
Therefore, our documents almost always start the same way:
"A brief high-level guide for the default scenario. What follows will be a lot of text and detailed variants. Let's start with the basic sequence:

Brew some tea and stop panicking.
Notify the responsible parties."

And yes, tea is mandatory. The dead system won’t get any worse in five minutes, but the risk of an engineer frantically pulling every lever they know is significantly lower. The last item sounds like, "Pull out a cold Guinness from the fridge (to be kept in emergency supplies for disasters). DRP complete."

DRP Structure

Here's the approximate structure we use as a template. Of course, you can adapt it as needed.

System Architecture
Information Systems
Notification of Responsible Parties
Key Questions Before DRP
What should I do by default?
A brief high-level guide for the default scenario.
When is this DRP applicable?
Principles for choosing a deployment site.
How to understand where the deployment will go?
Decision Making and Situation Analysis
The Order of Service Deployment and Timing
RPO and RTO
Instructions for Deploying DRP Infrastructure
Additional Materials

Thus, an engineer opening the document for the first time during an emergency should immediately understand how to conduct the initial diagnosis and begin recovery. The general structure should answer a few simple questions:

Who needs to be notified immediately about the start of an emergency?
How to properly conduct an initial diagnosis before touching anything? Ready simple console commands, links to dashboards, and other useful things.
How much time do I have? Can I try to fix the system, or do I need to start everything from scratch?
How to understand that the emergency has ended, and we’ve returned to normal operation?

It’s crucial not to confuse the system passport with its detailed description and DRP. Don’t overload it with excessive information. The engineer should be able to simply follow the instructions, copying command after command.
A good practice is to add an "Additional Materials" section at the end of the document. It can be referenced as needed from the main brief instruction and other documents. The Troubleshooting we described earlier fits well into this block. Any other additional information should also be moved to the end, to not disrupt the minimalist style of the main instruction. Gantt Charts Help Greatly in Complex Systems
If the service is complex and individual elements are deployed in parallel, I strongly recommend adding a Gantt chart, which visually describes the order of recovery and approximate timelines for each stage. Such information is harder to grasp in text form.
At the very end of your document, there should be a clear description of the conditions for canceling the emergency mode, so it’s clear when to switch the load back and return to normal operations.

The Importance of IaC

Overall, the concept of Infrastructure-as-Code (IaC) isn’t formally required for implementing a Disaster Recovery Plan (DRP). However, in most large information systems, engineers won’t meet the recovery timelines if they have to run from server to server, making emergency fixes in the configuration and changing DNS records on the fly.
It’s much more effective to describe all the primary and backup infrastructure in Terraform, and its configuration in Ansible. Optionally, you can even bake ready-made images with Hashicorp Packer if you adhere to the concept of immutable infrastructure.
In this case, you can achieve near-zero costs for keeping the DRP in a ready state. Structurally, it would look something like this:

Describe your test and production infrastructure in Terraform.
Describe your DRP infrastructure in Terraform, but add variables for deployment activation.

In variables.tf, add something like this:


variable "enable_mysql_drp" {
  description = "Condition for creation of DRP MySQL droplet. Droplet is created only if variable is true"
  type        = bool
  default     = false
}

In main.tf, describe the necessary parameters for your temporary infrastructure and tie them to conditions:


resource "digitalocean_droplet" "rover3-mysql-central-nl-1" {
  image      = var.ol7_base_image
  count      = var.enable_mysql_drp ? 1 : 0
  name       = "mysql-central-nl-1.example.com"
  region     = var.region
  size       = "c2-32vcpu-64gb"
  tags       = ["enigma", "enigma-central", "enigma-mysql-central"]
  monitoring = true
}

Now, an engineer can deploy the entire temporary infrastructure with a few commands like these from your instructions:


cd ~/enigma/terraform/DO
terraform apply \
-var='enable_mysql_drp=true' \
-var='enable_indexator_drp=true' \
-var='enable_clickhouse_drp=true' \
-var='enable_statistics_drp=true'

If you chose to use images already baked with Packer as the source for deploying emergency infrastructure, you’d get almost ready infrastructure within minutes. The overhead costs for storing the images are usually small, but they require constant updating.
Another option is the direct use of Ansible, which will configure the new infrastructure according to your configuration requirements. Don’t forget about the time for loading and restoring the database from a cold backup. It can be costly and time-consuming, so consider this in your planning.

A Brief Checklist

Here are a few key points to consider:

Murphy's laws haven't been repealed. If something can fall, it will eventually fall, with a crash and side effects in unexpected places.
Before starting, conduct a full analysis and show the business how much a potential disaster might cost, and how much the backup scenario for the DRP would be. Often after this, resources are allocated. Sometimes, they realize it would be cheaper if everything went down for a day than to have a full reserve on standby.
Make sure your backups are being performed.
Ensure that backups are not only performed but also can be restored.
Engage your creative paranoia and describe the threat model.
Write the DRP. Then take it and throw out most of the text so that the meaning isn’t lost. Add all excess to the end of the document. If needed, an engineer will look there.
Include in the document phone numbers, Telegram accounts, and other contacts of all key people.
DRP needs to be tested. Seriously, it's a must. Businesses never have time for these tasks, but it's a very important process. The infrastructure of any live project can change beyond recognition in a year, tokens can expire, accesses can disappear, and configurations can become invalid. Therefore, include at least an annual limited simulation in your business processes. Otherwise, in the event of a disaster, an engineer will enter commands from an outdated instruction and finish off your infrastructure.
Give the DRP to interns in a limited environment and watch as they struggle with the document (evil laugh). Your plan is good if even the most inexperienced team member can figure it out.

By the way, if your company's corporate policy allows it, I’d strongly recommend looking into LLM neural networks as an assistant for internal documentation. Yes, you should understand that, say, GPT-4 is not the ultimate truth. Nonetheless, it’s much more convenient to deal with a disaster in the middle of the night if an "expert" "keeping in mind" 120 pages of documentation on unique workarounds in your system is "sitting" next to you. We've just started implementing this approach, and it's already showing its best side.
A neural network can quickly analyze raw logs and build hypotheses about the causes of a disaster. This is invaluable in cases where the on-call engineer doesn't know all the nuances of the fallen system. I'll talk more about this next time.
And if you need to go through your infrastructure with a fine-tooth comb, write a DRP, and help with testing, come to us at WiseOps. We'll help.

Gumeniuk Ivan

DevOps Engineer