I recently worked in a situation where an entire forest root domain had to be recovered. The structure itself was relatively simple. It consisted of two domains; an empty forest root and a child domain with all the users, computers, etc. It also only had about 4,000 users.
But there were two (almost fatal) problems. First, the organization had built only one domain controller (DC) in the root domain. Second, to make matters worse, that DC had not been backed up in more than 10 months. Although the root DC had a RAID-5 disk configuration, disaster struck and two of the drives failed on the same day.
This type of configuration likely resulted from a best practice Microsoft espoused in the early days of Windows 2000. The recommendation at that time was to create an empty root domain so that child domains could be added and removed if the names changed (a domain could not be renamed once it was defined).
|
||||
This philosophy is no longer followed, however, since multiple domain forests have other complexities: restoring back links between groups and users in cross-domain groups, lingering objects held in the read-only context of Global Catalog servers, and other related issues. To avoid these issues, some organizations collapse their multiple domain structure into a single domain.
In this example, the two domains are Corp.com and EMEA.Corp.com, with Corp-DC1 the domain controller in the root domain and EMEA-DC1 and EMEA-DC2 in the child domain.
Note that all clients -- including users, workstations, and servers -- were unaffected by the issue, giving us time to develop and enact an action plan.
The challenge
This situation presented several questions and challenges, including:
Still, there were some positive factors in this disaster as well:
The recovery plan
An initial idea was to roll the EMEA DCs to their January backups, restore the Corp DC, roll the Child DCs forward, and then let it all catch up. This 20-step process required several days of downtime, and it was rejected because of its complexity and destructive nature.
We ended up using the following, simpler plan:
The process took about three weeks and most of the time was spent studying logs, doing the restore, etc. It was deliberate and methodical, as we wanted to make sure everything was done properly. In addition, users experienced no downtime. This means that although the forest seemed precarious without a root domain, it functioned well for user authentication and so forth while we worked on the restore. When we did the production restore, we did it during business hours without affecting the users.
The recovery process
The recovery process consisted of the following steps:
HKEY_LOCAL_MACHINE/System/CurrentControlSet/
Services/NTDS/Parameters
ValueName = Strict Replication Consistency
Data Type = Reg_DWORD
Value Data =1
C:>netdom trust Corp /domain:EMEA.corp.com /verify
The trust between Corp and EMEA.corp.com has been successfully verified.
Results
The initial results showed a number of errors and warnings in the event logs and some errors in the Repadmin /showrepl reports. Many of these errors occurred because the system was trying to get settled, and after running it overnight, most of the errors fixed themselves. We then we worked on the remaining events until all of them were resolved. The test and production environments yielded similar results.
A number of 1869 and 1865 events had difficulty finding a global catalog. In spite of all the events, replication worked, which we discovered by running Repadmin /replsum /bysrc /bydest /sort:delta:
Source DC | largest delta | fails/total | %%error |
Emea-DC2 | 32m:33s | 0/9 | 0 |
Corp-DC1 | 30m:16s | 0/4 | 0 |
EMEA-DC2 | 15m:16s | 0/6 | 0 |
Destination DC | largest delta | fails/total | %%error |
Corp-DC1 | 32m:33s | 0/3 | 0 |
EMEA-DC2 | 29m:23s | 0/10 | 0 |
EMEA-DC1 | 06m:32s | 0/6 | 0 |
Overall, the restoration worked extremely well and relatively error-free. It was accomplished with no downtime and very little risk to the environment. Authoritative backups did not have to be used and the trust did not have to be repaired. We had the confidence to put this plan into production because we had tested it in a test environment. Still, this is one of those situations where you think "this should work" -- but you don't really know until you try it.
Gary Olsen is a systems software engineer for Hewlett-Packard in Global Solutions Engineering. He authored Windows 2000: Active Directory Design and Deployment and co-authored Windows Server 2003 on HP ProLiant Servers. Gary is a Microsoft MVP for Directory Services and formerly for Windows File Systems. |
15 Feb 2010