Disaster recovery testing

Disaster recovery testing is an IT best practice designed to ensure any organization’s disaster recovery plan actually works across the entire chain of your company’s backup-and-recovery processes. It is a way of verifying that you’re backing up the right data safely and securely. Most importantly, it builds confidence that your data and applications are stored, backed up and can be recovered smoothly and be relied on to maintain business continuity. 

Disaster recovery testing not only demonstrates your ability to recover data and systems after an outage, but also refines your company’s plans to inform customers and partners in the event of a disaster. Overall, the goal is to ensure that you can recover from whatever disaster may strike, and that you’re in the best position possible to resume business as usual. 

In this article we’ll explore the main aspects of disaster recovery testing and offer some insights to help make the business case for rigorous disaster recovery testing a priority in your own enterprise.

What is disaster recovery testing? And, why is it important? 

Disaster recovery testing is the act of verifying that an organization’s disaster recovery plan will function as expected in the event of an emergency.  

Periodic disaster recovery testing is important because it helps to identify any gaps in recovery processes that may delay an organization from getting back to regular operations. 

Though it’s easy to think of disaster recovery as a one-and-done process, experienced IT teams view data protection as a collection of activities and practices: 

  1. The design and architecture of the system and processes in place for protecting data 
  2. Backup and restore operations that are dependent on each other  
  3. Disaster recovery testing 

Each of these components is a necessary ingredient in any well-thought-out disaster recovery plan. Viewing testing as an integral part of the disaster recovery process ensures that your data protection practices are working as they should and gives you confidence that you will be able to recover in the way you intend, should the time come. 

In short, a data protection plan without disaster recovery testing is incomplete. 

Potential disaster recovery testing scenarios to check 

Ideally, disaster recovery testing extends to a wide range of possible scenarios, subject to time, resources and practicality. 

The usual problem is that it’s not practical to test all disruptive disaster scenarios, which include but are not limited to: 

  • Natural disasters like earthquake, flood and weather damage 
  • Fire 
  • Equipment failure 
  • Human error 
  • Ransomware 
  • Vandalism 
  • Disgruntled staff and inside attacks Theft of valuable servers and expensive storage 

Naturally, a company with its headquarters and data center in Arizona is less worried about flooding than a Florida-based company that is accustomed to hurricanes. Similarly, recovering from human error is different from recovering from massive equipment failure or destruction. 

You can’t anticipate from which scenarios you’ll someday have to recover. So the sensible approach is to do research and make a list of the five or 10 disaster scenarios most likely to befall your particular company. Based on that list, you can develop a recovery plan with a high probability of preserving your business in the event of disaster. 

Overall, a disaster recovery plan should take into account the relative difficulty of recovering from different kinds of disasters. It should pose and answer probing questions including the following: 

  • If our hardware is ruined or unavailable, where will we host our company’s data? In a secondary data center? In a cloud service that you can spin up?  
  • How long will it take to provision the secondary infrastructure, or spin it up in the cloud? 
  • How much does each option cost? 
  • To execute the plan properly, which people and resources will we need? 
  • If our company spans multiple regions, do locale-specific regulations apply to backup and recovery? 

How to get started with disaster recovery testing

In any disaster recovery plan, the cardinal rule, of course, is to make sure that your backups are taking place and protecting prioritized applications and data in the first place. Once you’ve verified that, focus on the following steps. 

Prioritize system recovery 

An important part of a disaster recovery plan is figuring out which systems in an organization are the most important to protect and recover. It’s not possible to treat all the systems in your business the same, so you’ll need to classify your systems by importance. The result is a hierarchy of systems, starting with the must-haves necessary for business continuity after a disaster. Then, you organize your testing accordingly.  

For an online store, that might be the ecommerce site, shopping cart and payment systems. For a company with an elaborate sales organization, it could be the customer relationship management (CRM) application and related databases. Hospitals and health care providers depend heavily on electronic medical records (EMR) and prescription automation.  

As part of disaster recovery planning, backup administrators will need to establish what to back up, how often to back it up and the number of copies they need (or can afford) to store. And all of that is driven by the question, “Which systems in my business is it most important to protect and recover?” 

Identify system dependencies 

To verify their backups, many administrators will individually test the backups of servers, databases and applications. However, that’s not true disaster recovery testing. After a real disaster, you’ll be scrambling to restore entire systems comprised of assets that need to work together. Testing out individual components ignores the interactions and dependencies among systems.  Instead of testing components individually, it’s better to test scenarios of restoring all systems and ensuring they function as before.  

For example, your DHCP server needs to come up and assign IP addresses for dependent devices. Your Active Directory needs the DHCP server, and your front-end servers need your Oracle database.  

You can restore individual components separately, but if they don’t work properly together, you risk drawing the wrong conclusions. It’s probably not that the backup-restore cycle is corrupted; it’s more likely the fault of a missing dependency.  

Suppose you restore a database, start it up and confirm that you can log in and see the data. You can conclude only so much from that unless you also restore and launch your front-end application, then run some transactions that depend on the database. 

Testing yields many such takeaways about the required order of operations in a disaster scenario. Designing your disaster recovery testing with purpose and intent makes validating your plan much easier. 

Develop a testing schedule 

A regular disaster recovery testing schedule is crucial to ensure the pieces of your disaster recovery plan are working as expected. When developing a disaster recovery testing schedule, it’s important to list all the things you want to test, to establish a cadence of how often to test them and follow through on testing schedules. Align that with your hierarchy of importance by specifying testing every year, semester, quarter, month and so on. Additionally, spell out the testing methods to be used. Whether it’s a walkthrough, tabletop exercise, parallel or full interruption test, the type of disaster recovery testing exercise allows teams to look for potential gaps.

Important takeaways for every organization 

First, no IT environment with any level of complexity will nail their disaster recovery testing on the first try. The iterative process is punctuated with takeaways: things to change and improve that will shape disaster recovery testing in the future. 

One takeaway that might be uncovered in disaster recovery testing is that some systems and devices are so critical to the business that they are almost never restarted. That means that if something has happened on the system, months or years can elapse before the issue comes to light.  

Suppose, for instance, that you restore a database during disaster recovery testing and are unable to start it. You figure out that it hasn’t been restarted in the past two years, and that it will no longer restart because it needs a system update. If that comes to your attention through disaster recovery testing—rather than during a disaster—it’s a valuable takeaway you can incorporate with your everyday data protection activities. 

The biggest mistakes organizations make with disaster recovery testing

The biggest mistake, of course, is to not have a disaster recovery plan. The second biggest mistake is not regularly testing it, not sticking to a schedule and not making the effort to secure the disaster recovery plan you’ve already determined is necessary for business continuity. 

Another mistake is to overlook the less likely scenarios involving infrastructure. For instance, in the scenario of a flood or natural disaster, you may not have data center hardware available for a restoration operation.  

Testing, then, means ensuring you have adequate capacity to fail over to another data center or to spin up cloud services to take the load. And, even if you have enough capacity, will the restore be time-prohibitive? How long will your business be offline while you’re restoring your 6-terabyte database to the cloud? 

Protect all your systems, applications and data.

Protect all your systems, applications and data.

Gain continuous data protection, instant recovery and ransomware protection with our cloud-ready immutable backup solution.

Short of having to survive a real-life outage, only thorough disaster recovery testing can give you the answers to those questions. 

Conclusion 

Having disaster recovery alone does not solve every IT problem, and organizations often ignore the importance of disaster recovery testing at their own peril. Disaster recovery testing is a valuable tool for building confidence in the quality and availability of backups. 

By committing to regular disaster recovery testing according to your company’s hierarchy of systems, you arm yourself against the day when you need to restore for the sake of business continuity. Smart organizations and IT teams follow up on their commitment to data protection when they also commit resources to testing their disaster recovery plans. 

Reducing data loss risk with accelerated recovery

With the growing threats of disasters and cyberattacks, organizations need to ensure fast recovery with minimal loss. Learn about the latest threats and how to enable better RTO, RPO, data protection and recovery.

Watch the Webcast

About the Author

Aaron Newsome

Aaron Newsome has over three decades of experience in developing, implementing, and supporting enterprise storage solutions. His specialties are data protection, enterprise monitoring and data analytics. He is currently a product manager at Quest who manages the development and delivery of the portfolio of data protection solutions.

Related Articles