Every organization needs a solid Active Directory disaster recovery strategy. The reason is simple: Every second that your Active Directory (AD) is down, your business is dead in the water.
Why? Exactly what happens if Active Directory fails? Well, Active Directory provides the core services that employees need to do just about anything in your IT ecosystem, from logging on to their workstations to accessing data and running applications to serving customers. Therefore, when a disaster takes Active Directory down — whether it’s a cyberattack, a natural disaster or even a critical mistake by a hurried admin — everything comes to a screeching halt and stays that way until Active Directory is back up and running.
That gets expensive very quickly! In fact, 40% of enterprises say that a single hour of downtime costs $1 million to over $5 million. In a worst-case scenario, losses can reach millions of dollars per minute.
The best way to minimize this downtime — and the inevitable damage to your business — is to have a comprehensive Active Directory disaster recovery strategy. But what is Active Directory disaster recovery and how does it work? How do you actually recover a domain controller from backup? And — critically — how can you make the recovery process as quick and seamless as possible? Let’s dive in.
“The restore process from many well-documented ransomware attacks has been hindered by not having an intact Active Directory restore process.”
– Gartner, Inc., “How to Recover from a Ransomware Attack Using Modern Backup Infrastructure,” Fintan Quinn, June 4, 2021
What does the process of Active Directory disaster recovery entail?
Active Directory disaster recovery is all about getting your domain controllers (DCs) working again. DCs are servers that run the Microsoft Active Directory Domain Services role, enabling them to provide essential services such as authentication and authorization. Without at least one operational DC, your on-premises or hybrid Microsoft ecosystem cannot function.
Recovery of DCs requires meticulous coordination of numerous steps across multiple phases: preparing for the process, actually performing the restore, syncing the DC with its replication partners and making it available again, and more. That’s complicated enough if the disaster has taken down all the domain controllers in one Active Directory domain; if your entire Active Directory forest is affected, the Active Directory disaster recovery process is even lengthier and more complex.
The phases in an Active Directory recovery
The quickest way to get your business on its feet again after a disaster — and the best practice for Active Directory disaster recovery recommended by both Microsoft and industry experts — is a phased approach. At a high level, the strategy is to get one key DC in each domain back online quickly and then turn your attention to restoring the remaining DCs. Here’s a deeper dive into these two phases:
Phase 1: Restore one DC in each domain. By restoring one DC in each domain, you can quickly get your organization back up and limping, if not fully running.
The first step is to restore the selected DCs from backup. Your options here depend on what tools you have. With a less comprehensive solution, you might be limited to image-level recovery like bare metal recovery (BMR). An enterprise-quality solution may also offer clean operating system (OS) recovery, which is not possible using native tools. There are two main factors to consider if you have both options.
- Restoring to bare metal has multiple requirements. Although the target machine doesn’t need an operating system pre-installed, it does need the same physical disk layout. Plus, those disks must be at least as large as the original DC so the same partitions can be laid down just as they were on the original computer. You also have to worry about injecting boot-critical drivers, manual (command-line) network configuration and more. Restoring to a clean OS doesn’t involve these concerns; you start with a Windows server that already has all its drivers and is easily configured on the network using a GUI.
- BMR restores entire volumes (disk partitions), which includes files that are not part of AD, such as the boot sector, the program files directory, and the Windows and WinSXS directories. You may need this data if your domain controllers are being used for non-AD-related services like hosting DNS zones that are not AD-integrated, running a Certificate Authority, or running file and print services. However, these extra files provide a lot of places where malware can hide. Therefore, clean OS recovery is often the recommended approach, especially if the attack may have used a zero-day exploit.
After restoring from backup, you need to get the DCs communicating and functioning as a forest again. Microsoft’s Active Directory Forest Recovery Guide outlines 12 configuration procedures comprising 40+ steps that must be performed on every DC you’ve restored from backup. Failure to properly complete these steps can cause AD to break or leave lingering security vulnerabilities.
Phase 2: Promote the rest of your DCs. You have several options here as well, including a straight DCPromo, but Microsoft recommends promoting with “install from media” (IFM). Because IFM slashes the traffic sent across your network in half, it speeds the DC promotion process significantly.
How long does Active Directory disaster recovery take?
How quickly you can recover depends upon the details of your Active Directory environment, what tools and processes you use, how well your recovery plan is documented, and how much you’ve practiced. Every organization is different, so this question is akin to asking, “How long is a piece of string?”
When performed manually with Microsoft tools, AD forest recovery is a difficult, time-consuming and error-prone process. Moreover, if the process includes Azure AD recovery, your organization may run into synchronization and data consistency issues. For example, users may be missing cloud-based attributes like their Office 365 licenses and application role assignments, which are critical for them to be able to do their work. Since most of these attributes don’t go to the Azure AD Recycle Bin, restoring (aka “rebuilding”) them in a timely manner is nearly impossible with this approach.
Speeding the AD recovery process
A purpose-built Active Directory disaster recovery solution speeds recovery by automating many of the manual tasks in the process. This automation empowers organizations to move through the recovery process faster and with fewer IT staff, while ensuring steps are completed in the correct order and without errors. For instance, using native tools, DC promotion must be performed on each DC one by one, and each server will take several minutes to hours to promote. A third-party solution can automate the DCPromo with IFM process and promote several DCs in parallel, significantly shortening Phase 2 of the recovery. Bottom line: Automated solutions help your organization dramatically reduce Active Directory downtime — accelerating the return to normal operations and minimizing the costs of the disaster.
Even better, the best Active Directory disaster recovery solutions cover both Active Directory and Azure Active Directory recovery in one toolset. This unified functionality is critical to avoiding synchronization issues and ensuring the availability and integrity of both on-premises AD and Azure AD. With a single recovery dashboard, the IT team can easily differentiate hybrid and cloud-only objects, compare between production and real-time backups, and perform restore operations quickly.
Ideally, the tool will also provide not just disaster recovery but also granular recovery. This enables IT teams to restore specific objects (including Group Policy objects) and attributes (such as attributes accidentally modified by a migration task or metadata directory process gone awry) without going through the time-consuming process of restarting the domain controller.
Top tips for creating an airtight Active Directory disaster recovery plan
Even if you have the best tools, you still need a solid strategy to guide the recovery. For an effective and comprehensive Active Directory disaster recovery plan, follow these best practices:
Take regular AD backups and store them in a safe place.
Remember the infamous NotPetya attack in 2017? Within hours, this malware brought companies around the world to a standstill, including shipping giant Maersk. Although Maersk had backups of much of its data, nobody could locate a single Active Directory backup. While the IT team scrambled, the company had to reroute ships, was unable to unload cargo in dozens of ports and could not process new orders. In the end, Maersk was saved only by a stroke of luck: One DC at a remote office had been offline during the attack. The company painstakingly shuttled that precious machine to its headquarters to enable the AD recovery process.
Clearly, luck is not a strategy. Take regular AD backups, test them to ensure they’re valid, and store them on an isolated network, third-party cloud platform or other safe place! Microsoft recommends following the 3-2-1 rule: Keep 3 backups of your data on 2 different storage types, and keep at least 1 backup offsite. Quest recommends at least some of your backups be air-gapped (that is, not reachable from the computer network).
Types of backups
There are actually several types of backups to know about as you design your Active Directory disaster recovery strategy. Native options include:
- System State backups — These backups include almost the entire operating system, not just the Active Directory pieces. Whereas they contain all that’s needed for AD, they also contain much more, so for a cyber-disaster, they may not be ideal.
- Bare metal recovery backups — BMR enables you to restore your Active Directory DCs to different hardware instances. This option is particularly valuable in the case of physical corruption of DCs and related Active Directory disaster recovery scenarios.
Some third-party disaster recovery solutions provide additional backup options:
- Active Directory backups — These backups include only AD-specific components: the NTDS directory, SYSVOL (which contains Group Policy and logon scripts) and aspects of the registry that have to do with AD. By excluding the many other components in a System State or BMR backup, AD backups dramatically reduce the risk of reinfection by malware after the recovery process.
- Azure AD backups — In hybrid AD environments, you also need a backup strategy for cloud-only objects and attributes. Examples include not just the Microsoft 365 licenses and application role assignments mentioned earlier, but also Office 365 and Azure AD groups, cloud-only users like Azure B2B and B2C accounts, and Azure AD MFA settings and Conditional Access policies. These objects and attributes are not adequately protected by native Microsoft tools nor covered by any on-prem-only Active Directory backup solution. Therefore, without Azure AD backups, a recovery often leads to both business and security issues. Employees may not be able to access the resources they need to do their jobs, users may be able to access sensitive resources without completing multifactor authentication, and partner and customer accounts may be lost forever.
What about VM snapshots?
Note that I did not include VM snapshots — images of a virtual machine (VM) at a given point in time. That’s because VM snapshots are not adequate Active Directory backups. Using them for forest recovery will almost always result in data consistency problems that are difficult to resolve. Examples include lingering objects (objects that are present on one DC but that were fully deleted from other DCs) and Update Serial Number issues that will break replication. Plus, like BMR backups, snapshots can include malware, which will be restored onto the DC with everything else.
Finally, control over VM snapshots usually resides with the virtualization operations team, which complicates the AD team’s job during AD recovery. Moreover, the virtualization team might not even know that the snapshots are critical for disaster recovery and therefore not protect them appropriately. For example, storing them on a normal file share leaves them very vulnerable to corruption by attackers or malware.
Have emergency communications mechanisms that don’t rely on AD.
You need to ensure that business, IT and recovery functions can communicate with one another even if AD is down. Therefore, don’t rely on IT systems that might be unavailable, such as email or Microsoft Teams. Here at Quest, we use a secure, platform-agnostic messaging solution. We also share the mobile phone numbers of all critical recovery team members and configure our smartphones to allow notifications from each other even if the phone is set to “do not disturb.”
While you’re at it, make sure that you store your Active Directory disaster recovery plan itself somewhere you can access it even if Active Directory is down. One option is to go old-school and print it out; another is to store it in a separate cloud storage like Dropbox. Whatever you choose, make sure all stakeholders know how to access it and can do so quickly.
Identify the escalation path and the key decision-makers at every level of the path.
Plan for all the contingencies you can imagine. Also determine who will make decisions at every juncture and know how to contact them anywhere, anytime. Remember that when your IT ecosystem is down, every second counts. During the recovery effort, you simply can’t afford to be dithering about who’s authorized to make the decision to start recovery or who’s responsible for what.
Test the recovery plan with IT professionals who didn’t develop the plan.
It’s vital to ferret out any invalid assumptions, incomplete processes and missing information in your Active Directory disaster recovery plan and revise it accordingly. Repeat the testing and revision cycle until you get a rock-solid strategy that any qualified IT team member can execute. Otherwise, you’ll have a faulty or incomplete plan that might delay a real recovery or even send it in the wrong direction altogether.
Therefore, it’s critical have people who didn’t write the plan perform testing; they provide a vital new perspective. That said, AD is a little too complex for a layman to troubleshoot when something goes wrong. Testers should be AD architects with at least a semi-intimate understanding of your organization’s AD deployment.
Practice the plan at least twice a year.
The best way to speed your Active Directory disaster recovery is to practice the procedures in your plan until they become automatic. Everyone on the team needs to be completely versed in their responsibilities in different recovery scenarios.
Update your Active Directory disaster recovery plan regularly.
IT ecosystems are dynamic places, so be sure to update your Active Directory disaster recovery plan regularly. Be sure to account for changes in systems and processes, as well as changes to team composition or contact information. Every test run and every practice should result in plan updates and clarifications.
Also be sure to check for new or updated compliance mandates and business requirements. The applications that were labeled most important to the organization yesterday might not be the ones that are considered most critical today, which can affect how you prioritize AD recovery operations.
How to get the budget needed for a solid Active Directory disaster recovery strategy
If you’re an IT pro, you probably don’t need any more convincing that you need to be able to restore Active Directory quickly after a disaster. You know all too well who gets blamed for IT system downtime, and you don’t need the stress of having your phone ringing off the hook — and worrying about whether you’ll need to be updating your resume.
But you might need help convincing your management team to invest in the tools you need for effective Active Directory disaster recovery! Here’s the best strategy I know for getting them on board. It’s inspired by Tom Paxton’s tongue-in-cheek ballad “I’m Changing My Name to Chrysler,” which includes this sage line about humans: “We were hardly up and walking before money started talking.”
Accelerate Active Directory recovery
Calculating the cost of Active Directory downtime
So, talk money! Show how much every extra moment of Active Directory downtime will cost your organization. Here are the key components to include:
- Lost productivity — Without Active Directory to authenticate users and provide access to data and services, virtually no one will be able to do their job. It’s easy to calculate the cost of this lost productivity. For example, let’s assume average employee compensation is $87,000/year and 1,000 employees are idled. In that case, lost productivity alone will cost the organization more than a third of a million dollars per day.
- Lost revenue — How much revenue does your organization generate in a day? Multiply that by the number days of AD downtime. (Spoiler: This is likely also a huge number.)
- Lost business — Active Directory downtime is likely to affect your ability to take and process orders and otherwise fulfill client needs. So, tack on the lost future revenue stream from frustrated customers choosing to take their business to your competitors.
- Compliance penalties — If you’re subject to compliance regulations, you’ll need to add in the costs of fines and additional audits.
- Legal fees and remediation costs — If customer data is exposed, you might also face legal fees and costs for remediation measures like free credit monitoring.
- Cold, hard Bitcoin — If the downtime is due to a ransomware attack and you choose to pay the ransom, you’ll need to add in that cost as well. (Note that this money is probably wasted, since only 8 percent of organizations that pay the ransom actually get back all of their data.)
- Lasting damage to your organization’s reputation — The longer the outage continues, the more your company will be featured in the headlines. So, in addition to the customers who turn away immediately, you’ll lose revenue from all the prospects who will avoid you in the future. Indeed, you may well struggle for years to rebuild your brand. Although this damage can be hard to quantify, it could be the largest component of the total cost of AD downtime.
With a quantified estimate of the staggering cost of just one disaster, you can build a solid business case for creating a comprehensive Active Directory disaster recovery plan and investing in the tools required to speed the recovery process and ensure business continuity.
Summary
Active Directory is the beating heart of your IT ecosystem. The longer it’s down to an attack, natural disaster or mistake, the longer your business will be down — and the more it will cost. Therefore, every organization today needs a comprehensive Active Directory disaster recovery plan and the right tools. Take it from Gartner:
“Accelerate recovery from attacks by adding a dedicated tool for backup and recovery of Microsoft Active Directory.”
– Gartner, Inc., “How to Protect Backup Systems From Ransomware Attacks,” Nik Simpson, September 21, 2021.