We all hope disasters will never happen, and although we certainly don’t want to expect something to go wrong, IT organizations worldwide need to be prepared just in case. We all need to have disaster recovery plans in place that meet our individual needs, and organizations should take specific steps to ensure a backup strategy is in place that is right for them.
1. Am I Backing Up Everything I Should Be?
Historically, backup storage was expensive, space was limited, backup windows were tight, and the process took much more time to execute. There was a school of thought, as a result, that you didn’t have to back up static operating systems or applications (you could always just reinstall those) and that only actual user data truly needed to be backed up.
But storage prices have fallen considerably, disk performance has improved, and capacity has increased exponentially. The biggest issue now is this: after a disaster, how can I most quickly and easily recover everything I need? And the answer is that you should really be backing up entire workloads—operating systems, applications and data—together. Yes, it’s true you could reinstall the operating systems and applications, but in the post-outage pressure of trying to get everything back up and running, do you really want to?
2. Creating Service-Level Agreement Tiers
Long gone are the days when everything was backed up to tape overnight. Although tape can still be suitable in some situations, some workloads need zero or near-zero downtime, while others may have more tolerance but still need better performance than tape can offer. It’s clear that as technology and cost structures have evolved, one size no longer fits all. You must have different recovery tiers for different workloads in the organization. So how do you determine those tiers?
The best metrics for categorizing your disaster recovery needs and defining your service level tiers are Recovery Point Objective (RPO) and Recovery Time Objective (RTO).
RPO defines how much data an organization can afford to lose, in terms of time: a four-hour RPO dictates that the most recent backup can’t be more than four hours old, because you’re only willing to lose up to four hours of data.
By contrast, RTO defines how long it takes after an incident to restore service. A four-hour RTO means that systems must be back up and running four hours after an outage.
Ultimately companies should use these standardized metrics to define their service-level tiers on the basis of these organizational needs. Whatever the tolerance level for each area of your organization, that is what should define your service-level tiers and the cost for your disaster recovery mechanisms for each of those tiers.
3. Make Sure Your Backup Strategy Aligns with Your Business Requirements
In an ideal scenario, all pieces of the infrastructure would be considered critical, but that is not possible from a cost or resource standpoint for most organizations. Tier one represents business-critical infrastructure such as an online storefront that must be maintained to conduct business and so must be operating 24/7. Email is often considered second tier, and other systems such as a print server could be third tier, although it all depends on business priorities.
Make sure to look at your business and determine which areas truly need high availability. Many organizations still have tape and optical drives, along with some disk-to-disk backup; it’s important to match the value of your data and workloads to the characteristics of each backup technology that you are using.
For example, does the print server that no one uses need high availability with disk-to-disk backup? Or can you use that mechanism for tier one or tier two priorities such as an online storefront and/or email? These are things you will need to decide. Make sure to match the cost and recovery needs of each piece of infrastructure to the value of your workloads and the data you are protecting.
4. Prepare for the Recovery Phase
The recovery phase is what we typically refer to as the time spent getting things back up and running immediately after an event. The best analogy for this phase is replacing a blown tire with a spare. It is not necessarily a permanent solution, and performance may not be optimal, but it enables you to get moving in a short amount of time.
The same is true for the recovery phase – after a data center flood, for example, there is no way you can immediately repair or replace all of the damaged servers immediately, but you need to determine what the bare minimum is you can do to at least get things up and running.
What is your strategy here, and how can you quickly get back to business? Will this meet the requirements of the organization? You need to break this down to determine exactly which systems are the priority, and how long you can afford for them to be down. This should all be established and you should be able to communicate anticipated uptime and performance expectations to internal stakeholders in the event of a disaster.
5. Prepare for the Restoration Phase
The restoration phase is about putting things back the way they were. To continue with our car example, this would be comparable to taking the spare tire off and replacing it with a full-size, working tire so that your car is back to normal.
Just as you don't want to keep driving on a spare tire, you don't want your server workloads running in a recovery environment indefinitely. So how can you plan to get back to pre-disaster levels?
As we have discussed, a number of backup technologies are available on the market. Some of the most popular disk-to-disk backup solutions let you back up to a virtual machine (VM), and these are typically the fastest storage technologies available.
If your original environment was virtual, restoration is typically quite simple—you can easily move the recovery VM back to production. If the original source was a physical server, however, many solutions don’t let you restore back onto a physical server from the virtual storage environment. This creates a challenge, and you may be faced with operating continually on your “spare tire” if you can’t restore to a physical server.
This is very important to note when evaluating backup technologies. If you don’t operate physical servers in your environment, this typically will not pose any issues. Some organizations, however, need to run physical servers either for support or performance reasons, so this shouldn’t be ignored when working through your restoration strategy.
Ultimately you need the ability to get all of your workloads back to their original state after an event. A number of tools can help you prepare for, recover from and restore after a disaster, and each of these topics we touched on should certainly be a part of your thought process when determining your own backup strategy. Each organization is different and will likely require a custom approach, so make sure you know which pieces of your infrastructure are critical, create service-level tiers to align with business requirements, and have a recovery and restore process in place. We all hope bad things never happen, but in the event they do, it’s best to be are prepared for the worst in order to have the best chance of emerging unscathed.
About the Author
Mike Robinson is Senior Manager, Product Marketing at NetIQ.
Photo courtesy of IntelFreePress