One of the most challenging aspects of Active Directory disaster recovery involves recovering an entire forest. While I suspect that there have been few situations where a complete forest recovery was necessary, most IT disaster recovery plans require a documented recovery should the entire forest go toes-up.
The steps and practices required for a complete forest recovery -- for both small and large environments -- are very similar, and will depend on whether you have Windows Server 2003 or 2008 Active Directory installed.
It is important to understand the technology and method used for a forest recovery. Consider that the forest is made up of smaller components – namely objects, domain controllers, sites and domains – that are more likely to fail than the entire forest. Developing a forest disaster recovery plan will include these smaller components.
So let's take a look at what drives a good disaster recovery plan for Active Directory. The best recovery plans start with a good disaster prevention strategy, so let's begin there.
This isn't rocket science, but you'd be surprised how many administrators either have no current backup or no idea if the backup they have is usable. I recently worked with one company that had only one DC in the forest root domain of a multiple domain forest, and the backup was 10 months old. The DC crashed and could not be recovered (a future article will detail how we eventually recovered the forest).
At any rate, it is important to have a current backup and to understand that backups are only good for tombstone lifetime (TSL) days. TSL, typically recommended to be 120-180 days, is an attribute of the object: cn=Directory Service,cn=WindowsNT,cn=Services,cn=Configuration, dc=corp,dc=com, where "dc=corp,dc=com" is the DN of the forest.
The Windows Server Backup utility with Windows 2008 has some definite improvements over the previous NTBackup program. Windows Server Backup uses volume shadow copy for a faster, much more efficient backup method since it backs up to a single file and can be easily stored and managed on a remote server (like a NAS server, for example).
Note that Windows Server Backup does not support tape backup at this time, but the tape APIs are available if you have a third-party backup solution to backup to tape. It is recommended that you have the backups stored on disk for fast recovery, but also stored on tape for long-term storage.
The important thing is to backup your DCs regularly and monitor for errors. The minimum requirement is to backup at least two DCs in each domain in case one fails. If you have remote sites, , it's probably a good idea to back them up if you have IT staff there to do it and there is a safe place to store it.
The alternative to remote site backups is to rebuild the DC over the WAN or ship backup media to the site and use the Install From Media method to quickly restore a DC. The method you choose will largely depend on your organization and circumstances.
The lag site
The lag site is an oldie but a goodie. It is simply a site, usually created at the same physical location as the corporate hub site, with one or two DCs on a separate subnet from the hub site. The domain controllers are set on a schedule to allow replication only every seven days (one DC) or three-and-a-half days (two DCs). This is for recovery of mistakenly deleted objects such as organizational units (OU), users, groups, or computers by administrators, which happens more often than you might think.
If an OU with 5,000 users was deleted on Monday and the lag site DC only replicated every Sunday, it would not have seen the deletion of these user accounts. Therefore, you could perform an authoritative restore before it replicates again and recover the user accounts. Having a live (or virtual) machine to do the authoritative restore is much faster and easier than restoring from a backup first.
With Windows Server 2008 R2, the Active Directory Recycle Bin and the use of virtual machines have somewhat reduced the popularity of and need for the lag site method, but it is still a valid alternative.
Note that you must take precaution to prevent these lag site DCs from authenticating clients. I went over this in more detail in a previous article on implementing the lag site.
Virtual machines are widely used for domain controllers in many Active Directory environments. It is easy to have a virtual DC saved offline and then bring it back either to recover objects via an authoritative restore such as the lag site or to restore a domain.
Consider a case where you have four DCs on two physical hosts. Each day the VMs are taken offline and the virtual machine image for each one is copied to the other host. Thus Host A would run virtual DC1 and 2, and have saved images for DC3 and 4. Similarly, Host B would run DC3 and 4 and store images for DC1 and 2. Each host would be capable of running all four DCs if needed. The images could also be stored on a NAS server or a SAN disk.
In this scenario, the recovery of any DC would be as fast as booting up the image on the other host, or copying the image from a remote disk to the virtual machine host. This is a faster restore than a lag site would be. Tape backups of the images each day would further guard against failure.
Read-only domain controllers
The use of read-only domain controllers (RODC) can enhance the ability to get Active Directory services back online quickly. If your recovery plan requires DCs at remote sites to be re-promoted either over the WAN or with Install From Media, you need someone with proper permissions to perform the promotion. With RODCs, non-domain admins can be delegated the right to promote domain controllers, making recovery in small, unsecure sites go much faster.
The recovery process
There are several things to consider with a recovery process that should be determined in advance – before the panic starts.
Define what constitutes a scenario that requires a recovery. Just because a DC is down for a day doesn't necessarily imply that it should be recovered. Likewise, don't wait for several days for the situation to get worse. This should be defined in the "emergency" change control procedure.
The recovery priority should be set. While this will depend on your domain and site configuration you should generally:
Restore DCs in the corporate hub site first. Many organizations with a multiple domain forest locate a DC from each domain in the corporate hub to make disaster recovery easier. While it may not be sufficient to handle the full client load, it will at least keep the domain alive amidst a complete failure.
Restore DCs and global catalogs at Exchange Server sites. This is critical to keeping email alive, which will reduce the help desk calls for email failure.
Restore DCs and global catalogs at remote sites in order of importance. You should use criteria such as number (and/or importance) of users, critical applications that require DC access for authentication, and so on.
Restore DCs and global catalogs at poorly connected sites. These may be pushed up in priority to provide DC services for users that may have difficulty finding a remote DC due to poor network services. One philosophy is to let users at well connected sites authenticate across the WAN while DCs at poorly connected sites are restored first.
The priority should be defined in the recovery plan, with input by business units, help desk organizations, the network team, and so forth. Remember, the recovery needs to get the business restored as soon as possible.
While the need to recover an entire forest is highly unlikely, a good forest recovery plan will ensure that the underlying components can be recovered quickly and successfully. It is important to review the features and changes in future versions of Windows Server to determine how they will affect the recovery plan. Each new version of Windows Server seems to include improvements to make recovery more efficient, and may even provide justification for an upgrade.
ABOUT THE AUTHOR
Gary Olsen is a systems software engineer for Hewlett-Packard in Global Solutions Engineering. He authored Windows 2000: Active Directory Design and Deployment and co-authored Windows Server 2003 on HP ProLiant Servers. Gary is a Microsoft MVP for Directory Services and formerly for Windows File Systems.