Data folks talk a lot about uptime and “five nines”…a goal which means that 99.999% of the time, the system must be up. The issue is that many companies don’t bother to define what downtime means to them, and what downtime-causing threats are.
Define downtime
Before we can address uptime, we have to define downtime. Downtime could mean:
- A specific database is offline
- A specific server is offline
- System performance is slow enough to miss service level agreements (SLAs)
- Scheduled maintenance (or not)
Many shops with strict SLAs treat poor performance as downtime, and rightfully so. An excellent example of this is the drug interaction database I used to manage. When a doctor issues a prescription at a hospital, code checks that the new medication doesn’t interact badly with current medications. If the code is running too slow to trigger an alert, someone could die.
Define threats
Next, define what you need to protect against. Almost everyone overlooks this critical step, and instead jumps to a popular solution.
Do you want to protect against earthquakes, floods, tornados, hardware failures? How about software failure? And what does software failure mean to you? And don’t overlook internal espionage whether intentional or not.
What does internal espionage look like? Common causes from innocent sources include:
- a DDoS attack from a user running a huge query, which uses up all the resources on the server
- a massive un-batched delete
- an in-house application that doesn’t close connections and fills up the memory
- someone dropping an object or truncating an important table by accident
- a DBA who over-tunes the backups, maxing out server resources
You get the idea. Which of these are you going to protect against?
Define duration
The next big question is, “How long are you planning your downtime to be?” After all, you could run off the secondary server for weeks at a time, or just for a couple of hours while you fix the issue.
The answer will be different for each issue you’re protecting against. You can easily plan for a two-hour downtime to recover from dropping a table. It’s much harder to have a two-hour downtime for a motherboard failure.
You should have a quick way to recover from dropping non-table objects — like stored procedures and views. Done right, you shouldn’t need to fail over to recover those objects fast.
Put it in writing!
Next, meet with all the stakeholders and put all requirements in writing. You want no misunderstandings and no ambiguities.
Document limitations, so everyone knows what types of failures would be catastrophic. If the group decides against something, document the reasoning behind that decision. Things will change in the future, and the team may want to add in an element that they rejected before.
It is crucial that you get everyone’s agreement on all this before you go any further.
The bottom line
- What does “downtime” mean to you, and to your company?
- What are the threats you need to protect against?
- How long can an outage be, for each type of downtime?
- Now, write all that down.
In the next article, we cover the various types of HA solutions available, and what they’re actually for.
Meanwhile, follow us on LinkedIn and subscribe to our newsletter there!