Maybe I’ve never mentioned it, but backups are one of my big things. Today I’d like to talk about two topics that get overlooked quite often, the “backups” to the backup, so to speak. First up: proper backup alerting. And second, missing backup recovery.
Traditional alerting falls short
Well, let’s begin with a story from my days as senior DBA. Years ago, one of the application groups messed something up in their database, and they needed a restore. “Sure thing,” I said. No problem. So I went to the backup drive, and there wasn’t anything that could even be vaguely considered a fresh backup. The last backup file on the drive was from about three months ago.
OOPS. Oh crap…so what do I tell the app team?
First, a little investigation. I had to find out why the backup alert didn’t kick off. Every box was set up to alert us when a backup job failed. I found the problem right away. The SQL Agent was turned off. And from the looks of things, it had been turned off for quite some time. And as you may realize, there’s just no way to alert on missing backups if the Agent is off and can’t fire the alert.
But that was just the first part of the problem. The SQL Agent couldn’t send the email, of course. But the job never actually failed, because it didn’t start in the first place.
This is the crux of the issue: jobs that don’t start, can’t fail. Alerting on failed backup jobs isn’t the way to go.
“But it’s okay, we have…”
Hold on, I know what you’re thinking. You have service alerts through some other monitoring tool, so that could never happen to you! To a degree, you’re right. But let’s see what else can go wrong along those same lines:
- The database in question isn’t included in the backup job.
- The network monitor agent was turned off, or not deployed to that server.
- SMTP on the server has stopped working.
- The backup job has the actual backup step commented out.
- Someone deleted that backup job.
- Someone disabled the backup job, or just disabled the job’s schedule.
Service alerts won’t help you in any of these circumstances.
Proper backup alerting
I’ve run into every one of those scenarios, many times. And there are only two ways to mitigate every one of them (and any other situation you come across) with proper backup alerting.
Number one: Move to a centralized alerting system. You can’t put alerts on each of your servers. When you do that, you’re at the mercy of the conditions on that box, and those conditions can be whimsical at best.
Move the backup alerts from the server level to the enterprise level. Then, when there’s an issue with SMTP or something, you only have one place to check. It’s much easier to keep track of whether an enterprise-level alerting system isn’t working than to keep track of dozens, hundreds, or even thousands of servers. After all, if you haven’t heard from a server in a long time, how do you know whether it’s because there’s nothing to hear, or if the alerting mechanism is down?
Number two: Stop alerting on failed backups. Alert on missing backups. When you alert on missing backups, it doesn’t matter if the job didn’t kick off, if the database wasn’t part of the job, or if the job was deleted. The only thing that matters is that it’s been 24 hours since your the backup. Then when you get the alert, you can look into what the problem is. The important point is that the backup may or may not have failed, but your enterprise alert will fire no matter what. This is a very effective method for alerting on backups, because it’s incredibly resilient to all types of issues…not only in the backups, but also in the alerting process. If you do it right, it’s just about foolproof.
Part 2: Missing Backups
Handling missed backups is not the same as alerting on missing backup (like we talked about above). What we want to do is avoid the need for the alert to begin with.
Minion Backup (which is free, so we get to talk about it all we want, ha!) includes a feature called “Missing Backups”, which allows you to run any backups that failed during the last run.
Here’s what this looks like: You set your backups to run at midnight, and they’re usually done by around 2:00 AM. However, occasionally they fail for one reason or another. Then you get an alert in the middle of the night, and you have to get up to deal with it.
Missing Backups lets you set Minion Backup to run again at, say, 2:30 or 3:00 AM with the @Include = ‘Missing’ parameter. This will look at the last run and see if there were any backups that failed; if there were, then MB will retry them. This will prevent the need for alerts in the first place.
We use this feature in many shops we consult in because we see databases that fail from time to time for weird reasons, but they always pass the second time. So Minion Backup helps improve your backups simply by giving you a second chance at your backups.
Now we mention Minion Enterprise
We’ve got you covered for enterprise-level alerting, too. Our flagship product, Minion Enterprise, was made for just that purpose and it comes with many enterprise-level features; not just backup alerting. I invite you to take a look at it if you like.
But if you don’t then by all means, write yourself an enterprise-level alerting system and stop relying on alerts that only fire on missing backups.
And, improve your situation in general by switching to the free Minion Backup.
Like our content? Sign up for our newsletter!