You’ve heard me talk about this many times, in so many different ways, but it’s worth repeating: SQL maintenance lifecycles are important.
People who disagree, disagree because they spend too much time firefighting and don’t have time to really think about it. Don’t know anything about SQL maintenance lifecycles? Let’s touch on a couple of points…today, we’ll cover SQL backup.
SQL maintenance lifecycles: Backups
Backups have the biggest lifecycles of all, because the uses for backup files are so varied.
They’re used for restoring databases to the original server, sure. But here’s a short list of the other kinds of things you need to do with backup files:
- Restore to development, QA, etc.
- Send to vendor
- Initialize replication
- Initialize Availability Groups
- Restore to a secondary server to run CheckDB
A backup is not a single event
A database backup isn’t a single event. But unfortunately, nearly every vendor out there – and most DBAs – treat backups as a single event that has a start and a finish. “Look, I took a SQL backup and produced a file, and now I’m finished!” But that’s not remotely the case. We use and shuffle SQL backup files around all day and all week long on our various servers.
This is why I’m so against those tools that I call “network backup tools”. You know the ones: Backup Exec, AppAssure, etc. Your network admin loves those tools, because he doesn’t understand SQL and he wants the same backup utility for the whole shop.
But network backup tools simply aren’t going to cut it for SQL maintenance lifecycles, because they take snapshots of the data and log files. And quite often those two files aren’t fully synchronized when the snapshot takes place, so you wind up creating a new log file and losing whatever data was in there. Not only that, but you can only attach those files somewhere, usually on the same server they came from.
Without the understanding that your maintenance has a lifecycle, you wind up with a lot of piecemeal solutions. You write something to restore a database on a development box, then another DBA does the same thing, then another, then another. Soon you have a huge mix of solutions that need support, all created by a mix of DBAs over the years…it’s an unmanageable web of solutions, each one with its own quirks and limitations. That’s no way to run a shop, and it’s no way to support your data.
SQL Maintenance Lifecycles: SQL backup done right
This is why we’ve made Minion Backup the way we have. Minion Backup is the only backup tool on the market that treats your environment holistically. Minion Backup understands your SQL maintenance lifecycle and it works for you to provide you built-in solutions that you just deploy. There’s no need to write your own processes, or manage something separate because it’s built into Minion Backup by default.
Let’s take for instance that you need to back up your SQL certificates, because you know, who doesn’t? With Minion Backup you can easily backup every certificate, even to multiple locations. In fact, you can back up your certificates to as many locations as you need. This functionality is built into Minion Backup, and setting it up is a simple table setting.
Not only that, but since we store the private key password encrypted, and give it to you in the log, you can even rotate the private key passwords on any schedule you like to increase your security. This way if you change the private key password every month, then you’ve always got access to the encrypted value in the log so you can restore that certificate with ease. This is what we mean by SQL maintenance lifecycle management. All of our SQL maintenance products have this level of lifecycle management. And we’re improving them all the time. We work closely with our users to give them built-in solutions that matter.
All of our SQL maintenance products are completely FREE, so download them and try them now. And if you like what you see, then try our enterprise products and see the power of bringing your SQL maintenance lifecycles into the enterprise architecture.
Download our FREE maintenance modules below:
- Minion Backup: Free SQL Server Backup utility with world-class enterprise features and full lifecycle management.
- Minion Reindex: Free SQL Server Reindex utility with unmatched configurability.
- Minion CheckDB: Free SQL Server CHECKDB that solves all of your biggest CHECKDB problems.
Integrity checks have become more and more important over the years, as companies store more and more data. This data is absolutely critical, so when corruption happens, it can break a company.
But in all but the smallest shops, running any kind of integrity check is all but completely useless.
Integrity checks without management
DBCC CHECKDB was originally written without consideration of management. For a very long time, you could only output CHECKDB results to the client – meaning you had to run it by hand – or to the error log. Seriously, what good is that? Later, Microsoft added the ability to output CHECKDB results to a table, but that remained undocumented for years so very few people knew it even existed.
That’s not the only problem. A lot of DBAs don’t run CHECKDB at all, because large databases take a long time to process. CHECKDB takes an internal snapshot for its operations. For large databases, that snapshot can fill up the drive, killing the integrity check process entirely. So, you have nothing to show for your efforts. Some DBAs tried to get around this issue by moving to CHECKTABLE instead, but that process comes with the same issues, and is even harder to manage because there are so many more tables than databases.
Integrity checks without reporting
Another reason that running DBCC CHECKDB is useless is this: reporting is still very difficult. Let’s say you have 50 servers – yes, I know a lot of you may have many more than that – and CHECKDB is set to run through a maintenance plan, outputting results to the SQL error log. In this scenario, you have to mine the log for CHECKDB results. That means you not only have to know when the CHECKDB completes on each and every server, but you also have to know what the log entry will look like, and how to parse out the results so they’re readable…again, on every single one of 50 servers. There is simply no easy way to consume CHECKDB results on any kind of scale.
Minion makes integrity checks useful again
The whole goal of Minion CHECKDB was to transform these issues surrounding integrity check processing into simple settings. Now DBAs of every skill level can have a stable, standardized integrity check process for even their largest databases.
Minion CHECKDB has a huge number of groundbreaking features. MC solves the reporting problem by capturing integrity check results to our own log table, along with a huge amount of additional information. Now it just takes a simple query to get the results. Instead of the native raw CHECKDB results, the MC log table tells you how many corruption errors were encountered. Then you can drill deeper from there if you like.
We’ve also solved the problem of the large database that can’t be processed because it runs out of snapshot space. In fact, we’ve solved that a few ways, and I’ll discuss a couple of them here.
- You can create a custom snapshot on a different drive that has more space. MC manages those snapshots for you, so it’s literally a set and forget operation.
- Or, you can restore a recent backup of your database to a remote system and run CHECKDB against that instead. Remote CHECKDB even brings the results back to the primary server, so you don’t have to go hunting for them. Again, this is a setting; you configure it, and then MC manages the whole process for you.
Of course, there’s more
There are many more features in Minion CHECKDB that I didn’t mention. But since MC is completely free, you can download it and install it on as many instances as you want. Go explore!
Put MC on all your systems and start doing integrity checks again. We’ve taken away all the excuses. And when coupled with Minion Backup and Minion Reindex (both also free), you’ve got a maintenance suite that meets all your needs from planning to reporting, and everything in between.
*Without proper management and reporting!
A user asked me the other day how to go about managing backup issues in Availability Groups. Well, Minion Backup is fully Availability Group aware, and has innovative features that makes it a complete solution. This is one of the big reasons we coded MB: any solution for dealing with the issues listed below has to be manually coded. We wanted to stop re-inventing the wheel over and over.
Let’s talk about some of the challenges that DBAs face when managing backups with Availability Groups. Then we’ll talk about solving these issues with Minion Backup.
Managing Backup Issues in Availability Groups – The Issues
1. Back up from…wherever
Availability Groups don’t let you pick a specific server to run backups on. You can only give weight to a specific node, but you can’t say that you want logs to run on Server1 and Fulls to run on Server2, etc. You can code those choices in, but if you decide to make a change, you have to make it on all the nodes manually.
2. Mixed file locations
Pretty much every backup solution on the market appends the server and/or instance name to the backup path. Usually, this is a good idea: if you have a centralized backup drive, you’ll want your backups separated by server name. Availability Groups introduce a problem though, because you can back up on different nodes. So of course, each backup will be under a different path (based on the node name). Full backups could be going to \\NAS1\SQLBackups\Server1, while the log backups for the same Availability Group would be going to \\NAS1\SQLBackups\Server3 and even \\NAS1\SQLBackups\Server2. The backups can move around as nodes go up and down. Restoring in this scenario can be difficult if all the backups aren’t in one place.
3. COPY_ONLY woes
Backups have to be run with COPY_ONLY on non-primary nodes. This isn’t an issue on its own, but it is something that you must code into your solution. And if you remove a database from an Availability Group, then you have to remember to manage that in your code (or preferably, to have coded it properly in the first place). Either way, it’s just one more thing to manage.
4. Work multiplier
Any change to a schedule, or to a backup setting must be made on all nodes. Nothing in the Availability Group setup lets you manage your multiple backup locations, and remembering to do this on all the nodes is doomed to fail.
5. Backup file backlog
If your backup node fails, you lose your record of where its backups were. Without that log, the backup files won’t delete on your defined schedule. Sure, you could probably get that log back by restoring a copy of the management database from that server, but how much time has gone by since then? That is, of course, if the management database was being backed up properly. And if all that happens, it’s just one extra step you don’t need when you’ve got a node down. You shouldn’t have to worry about those tasks that should be taken care of for you.
These are the biggest issues with managing backups in Availability Groups. Now let’s look at Minion Backup. It solves each one of them with no coding required.
Managing Backup Issues in Availability Groups – The Solutions
1. Back up from the right server
Backing up from a specific server is easy in Minion Backup. Minion Backup knows if your database is in an Availability Group, and you can set the “PreferredServer” column in the Minion.BackupSettingsDB table to tell it where you want each backup type taken. You have the option to use the backup preference in the Availability Group settings by configuring PreferredServer to ‘AGPreferred’, or you can use the name of a specific server. That means you can set a specific server for fulls, and a specific server for logs if you want. If you want Minion Backup to determine your PreferredServer dynamically, you can code your own logic use it as batch precode. While it’s an easy setting, you can make it as flexible as you need it to be.
2. Files in one place
We know that backups coming from different servers means that all of your different backups can be in different locations. However, Minion Backup solves this problem easily using server labels. The ServerLabel field (in Minion.BackupSettingsPath) overrides the ServerName in the backup path, so that all of your backups go to the same location. You can set ServerLabel to the Availability Group or listener name, or even the application name if you want. So, instead of sending backups to \\NAS1\SQLBackups\Server1, the backups would go to \\NAS1\SQLBackups\MyAGName. As long as you set all the nodes to the same ServerLabel, then no matter which node they backup to, they’ll all go to the same location. Now you don’t have to search around for where the different backups are and your restores and file deletes are easy.
3. COPY_ONLY automatically
If you want to take a full backup on a non-primary node of an Availability Group, then you have to use COPY_ONLY. And of course, you do have to code that into your own solution. However, Minion Backup is fully aware of the Availability Group, so it does this for you automatically. There’s nothing you need to code, and nothing you need to manage. It just does it. As databases come and go out of your Availability Groups, Minion Backup makes the right choice and does what it needs to do.
4. Sync settings between nodes
Manually changing settings on each node of your Availability Group is just asking for trouble. You could have a new node that you haven’t heard about, or you could just forget about one, or a node could be down and then you get busy and forget to push the changes when it comes back up. So what you need is a way to automatically push all of your settings changes to all the nodes. This is where Minion Backup comes in. All you need to do is turn on the SyncSettings column in the BackupSettingsServer table and Minion Backup will keep all your node settings in sync. This also means that as nodes come and go, they’ll get the new settings, and if a node is offline, Minion Backup will push them when it comes back online.
5. Backup files, handled
If you could keep your file deletion schedules even when a node goes down, or to be able to easily find the file locations, it can keep you from having a bad day. Minion Backup spares you the worry about where your files, are or whether they’re going to still be deleted on schedule. Again, Minion Backup makes this very easy with the SyncLogs column in the BackupSettingsServer table. Your backup logs will automatically be synchronized to all nodes, so locations and delete times are always in place.
That’s a good roundup of the Availability Group features in Minion Backup. We love these features, because we’ve seen firsthand how much easier it is to manage your Availability Group backups with Minion.
Go on, solve your problems: download Minion Backup!
When I saw the topic for T-SQL Tuesday this time I just had to get in. Maybe I’ve never mentioned it, but backups is one of my big things. Today I’d like to talk about two topics that get overlooked quite often, the “backups” to the backup, so to speak. First up: proper backup alerting. And second, missing backup recovery.
Traditional alerting falls short
Well, let’s begin with a story from my days as senior DBA. Years ago, one of the application groups messed something up in their database, and they needed a restore. “Sure thing,” I said. No problem. So I went to the backup drive, and there wasn’t anything that could even be vaguely considered a fresh backup. The last backup file on the drive was from about three months ago.
OOPS. Oh crap…so what do I tell the app team?
First, a little investigation. I had to find out why the backup alert didn’t kick off. Every box was set up to alert us when a backup job failed. I found the problem right away. The SQL Agent was turned off. And from the looks of things, it had been turned off for quite some time. And as you may realize, there’s just no way to alert on missing backups if the Agent is off and can’t fire the alert.
But that was just the first part of the problem. The SQL Agent couldn’t send the email, of course. But the job never actually failed, because it didn’t start in the first place.
This is the crux of the issue: jobs that don’t start, can’t fail. Alerting on failed backup jobs isn’t the way to go.
“But it’s okay, we have…”
Hold on, I know what you’re thinking. You have service alerts through some other monitoring tool, so that could never happen to you! To degree, you’re right. But let’s see what else can go wrong along those same lines:
- The database in question isn’t included in the backup job.
- The network monitor agent was turned off, or not deployed to that server.
- SMTP on the server has stopped working.
- The backup job has the actual backup step commented out.
- Someone deleted that backup job.
- Someone disabled the backup job, or just disabled the job’s schedule.
Service alerts won’t help you in any of these circumstances.
Proper backup alerting
I’ve run into every one of those scenarios, many times. And there are only two ways to mitigate every one of them (and any other situation you come across) with proper backup alerting.
Number one: Move to a centralized alerting system. You can’t put alerts on each of your servers. When you do that, you’re at the mercy of the conditions on that box, and those conditions can be whimsical at best.
Move the backup alerts from the server level to the enterprise level. Then, when there’s an issue with SMTP or something, you only have one place to check. It’s much easier to keep track of whether an enterprise level alerting system isn’t working than to keep track of dozens, hundreds, or even thousands of servers. After all, if you haven’t heard from a server in a long time, how do you know whether it’s because there’s nothing to hear, or if the alerting mechanism is down?
Number two: Stop alerting on failed backups. Alert on missing backups. When you alert on missing backups, it doesn’t matter if the job didn’t kick off, if the database wasn’t part of the job, or if the job was deleted. The only thing that matters is that it’s been 24 hours since your the backup. Then when you get the alert, you can look into what the problem is.
The important point is that the backup may or may not have failed, but your enterprise alert will fire no matter what. This is a very effective method for alerting on backups, because it’s incredibly resilient to all types of issues…not only in the backups, but also in the alerting process. If you do it right, it’s just about foolproof.
Part 2: Missing Backups
Handling missed backups is not the same as alerting on missing backup (like we talked about above). What we want to do is avoid the need for the alert to begin with.
Minion Backup (which is free, so we get to talk about it all we want, ha!) includes a feature called “Missing Backups”, which allows you to run any backups that failed during the last run.
Here’s what this looks like: You set your backups to run at midnight, and they’re usually done by around 2:00 AM. However, occasionally they fail for one reason or another. Then you get an alert in the middle of the night, and you have to get up to deal with it.
Missing Backups lets you set Minion Backup to run again at, say, 2:30 or 3:00 AM with the @Include = ‘Missing’ parameter. This will look at the last run and see if there were any backups that failed; if there were, then MB will retry them. This will prevent the need for alerts in the first place.
We use this feature in many shops we consult in because we see databases that fail from time to time for weird reasons, but they always pass the second time. So Minion Backup helps improve your backups simply by giving you a second chance at your backups.
Now we mention Minion Enterprise
We’ve got you covered for enterprise-level alerting, too. Our flagship product, Minion Enterprise, was made for just that purpose and it comes with many enterprise-level features; not just backup alerting. I invite you to take a look at it if you like.
But if you don’t then by all means, write yourself an enterprise-level alerting system and stop relying on alerts that only fire on missing backups.
And, improve your situation in general by switching to the free Minion Backup.
Every IT shop has its problems with performance: some localized, and some that span a server, or even multiple servers. Technologists tend to treat these problems as isolated incidents – solving one, then another, and then another. This happens especially when a problem is recurring but intermittent. When a slowdown or error happens every so often, it’s far too easy to lose the big picture.
Some shops suffer from these issues for years without ever getting to the bottom of it all. So, how can you determine what really causes performance problems?
First, a story
A developer in your shop creates an SSIS package to move data from one server to another. He makes the decision to pull the data from production using SELECT * FROM dbo.CustomerOrders. This works just fine in his development environment, and it works fine in QA, and it works fine when he pushes it into production. The package runs on an hourly schedule, and all is well.
What he doesn’t realize is that there’s a VARCHAR(MAX) column in that table that holds 2GB of data in almost every row…in production.
Things run just fine for a couple months. Then without warning, one day things in production start to slow down. It’s subtle at first, but then it gets worse and worse. The team opens a downtime bridge, and a dozen IT guys get on to look at the problem. And they find it! An important query is getting the wrong execution plan from time to time. They naturally conclude that they need to manage statistics, or put in a plan guide, or whatever other avenue they decide to take to solve the problem. All is well again.
A couple of days later, it happens again. And then again and then again. Then it stops. And a couple weeks later they start seeing a lot of blocking. They put together another bridge, and diagnose and fix the issue. Then they start seeing performance issues on another server that’s completely unrelated to that production server. There’s another bridge line, and another run through the process again.
What’s missing here?
The team has been finding and fixing individual problems, but they haven’t gotten to the root of the issue: the SSIS package data pull is very expensive. It ran fine for a while, but once the data grew (or more processes or more users came onto the server), the system was no longer able to keep up with demand. The symptoms manifested differently every time. While they’re busy blaming conditions on the server, or blaming the way the app was written, the real cause of the issues is that original data pull.
Now multiply this situation times several dozen, and you’ll get a true representation of what happens in IT shops all over the world, all the time.
What nobody saw is that the original developer should never have had access to pull that much data from production to begin with. He didn’t need to pull all of the columns in that table, especially the VARCHAR(MAX) column. By giving him access to prod – by not limiting his data access in any way – they allowed this situation to occur.
What Really Causes Performance Problems?
Just as too many cooks spoil the broth, too many people with access to production, will cause instability. Instability is probably the biggest performance killer. But IT shops are now in the habit of letting almost anyone make changes as needed, and then treating the resulting chaos one CPU spike at a time.
This is why performance issues go undiagnosed in so many shops. The people in the trenches need the ability to stand back and see the real root cause of issues past the singular event they’re in the middle of, and it’s not an easy skill to develop. It takes a lot of experience and it takes wisdom, and not everyone has both. So, these issues can be very difficult to ferret out.
Even when someone does have this experience, they’re likely only one person in a company of others who aren’t able to make the leap. Management quite often doesn’t understand enough about IT to see how these issues can build on each other and cause problems, so they’ll often refuse to make the necessary changes to policy.
So really, the problem is environmental, from a people point of view:
- Too many people in production makes for an unstable shop.
- It takes someone with vision to see that this is the problem, as opposed to troubleshooting the symptoms.
- Most of the time, they’re overridden by others who only see the one issue.
What’s the ultimate solution?
In short: seriously limit the access people have in production. It’s absolutely critical to keep your production environments free from extra processes.
Security is one of those areas that must be constantly managed and audited, because it’s quite easy to inadvertently escalate permissions without realizing it. This is where Minion Enterprise comes in: I spent 20 years in different shops, working out the best way to manage these permissions, and even harder, working out how to make sure permissions didn’t get out of control.
Minion Enterprise gives you a complete view of your entire shop to make it effortless to audit management conditions on all your servers.
That’s the difference between performance monitoring and management monitoring. The entire industry thinks of performance as a single event, when in reality, performance is multi-layered. It’s comprised of many events, management-level events where important decisions have been ignored or pushed aside. And these decisions build on each other. One bad decision – giving developers full access to production – can have drastic consequences that nobody will realize for a long time.
Sign up for your trial of Minion Enterprise today.