Whether you have a forest with many remote domain controllers (DCs) spread over many sites all over the globe,...
or just a few DCs in remote sites, you know how hard it is to monitor them for replication errors. Of course, there are many expensive third-party products that will let you customize reporting in a pretty GUI interface and send you e-mail or page you. But if you are like most companies trying to trim IT costs, you probably can't afford one of these tools, especially if you are a small or medium-sized company.
Well, while third-party companies have been developing sophisticated, flashy tools, Microsoft has been busy enhancing their command line tools available natively in the OS, as well as in the Windows Support Tools and Resource Kit tools. One of the most powerful yet overlooked tools is Repadmin. It seems that Microsoft's support engineers provide feedback to developers about tools and features they need, so these tools – especially the Windows Support Tools --are constantly getting upgraded.
Repadmin has had significant features added in both Windows 2003 and Windows 2003 SP1. Note that applying SP1 will not upgrade the Windows Support tools. You need to either install them from the SP1 CD or you will need to download them from the Microsoft Web site at: https://premier.microsoft.com/default.aspx?scid=kb;en-us;892777 . It is important that you install these at least on every DC. When you are troubleshooting a problem, you don't want to start scanning the Web to find tools to download and install.
Monitoring Replication is critical to maintaining AD health yet many administrators don't seem to be concerned, or perhaps just don't have time. Working in third-level customer support, I see cases every day of administrators who suddenly notice that a DC hasn't replicated for several months. In one case last year, we got logs from one administrator that showed that a certain DC hadn't replicated since October of 2001!
It seems then, that we could really use a tool to give us an end-to-end report on the status of replication without combing through of event logs. Here's where Repadmin comes to the rescue! Windows 2003 Support Tools provided a new option in Repadmin called /ReplSummary, or simply /ReplSum. This option has a number of arguments but I've only used the following three:
- /BySrc -- List the DCs that are replication "sources" and replication failures. This lists DCs and their outbound replication status – that is, the last time outbound replication occurred.
- /ByDest -- List the DCs that are "destinations" and their replication failures. This lists DCs and notes their inbound replication status, or the last time inbound replication occurred.
- /Sort:Delta -- This is just a formatting switch indicating we want to see the DCs starting with those who have the largest time since successful replication (the worst are first).
I executed the following Repadmin command on our test forest that contains a root (Qtest.cpqcorp.net) and two child domains (Qamericas.Qtest.cpqcorp.net and QEMEA.Qtest.cpqcorp.net) and about 15 DCs around the globe. Qtest is a Windows 2003 native domain, Qamericas is a Windows 2000 Native domain and QEMEA is a native Windows 2000 domain with a mixture of 2000 and 2003 DCs. This shows that this command works well in a very diverse environment.
Repadmin /replsum /bysrc /bydest /sort:delta
The resultant report shows some important information in relation to the health of this forest. The first column is the DC name. The second column, Largest Delta, is the time since last successful replication. The Fails/Total column indicates how many failures of the total sample occurred and the fourth column, %%, indicates the % failure. So Qtest-DC5 failed 20 times in 20 tries for a failure rate of 100%. The final column is the Replication error causing the failure.
|Source DC||Largest Delta||Fails/Total||% %||Error|
|QTEST-DC5||>60 days||20/20||100||(1753) There are no more endpoints available from the endpoint mapper.|
|QTEST-DC22||>60 days||31/31||100||(8524) The DSA operation is unable to proceed because of a DNS lookup failure.|
|KPARKHURST4||>60 days||11/11||100||(8524) The DSA operation is unable to proceed because of a DNS lookup failure.|
|QEMEA-DC32||>60 days||16/16||100||(1722) The RPC server is unavailable.|
|QEMEA-DC4||01d.12h:35m:15s||11/11||100||(1722) The RPC server is unavailable.|
|Destination DC||Largest Delta||Fails/Total||%%||Error|
|QEMEA-DC3||>60 days||25/65||38||(1753) There are no more endpoints available from the endpoint mapper.|
|QTEST-DC9||>60 days||16/35||45||(1753) There are no more endpoints available from the endpoint mapper.|
|QEMEA-DC7||>60 days||15/64||23||(1753) There are no more endpoints available from the endpoint mapper.|
|QAMERICAS-MDC1||>60 days||5/27||18||(8524) The DSA operation is unable to proceed because of a DNS lookup failure.|
|QAMERICAS-DC39||>60 days||6/26||23||(8524) The DSA operation is unable to proceed because of a DNS lookup failure.|
|QEMEA-MDC1||>60 days||17/58||29||(1722) The RPC server is unavailable.|
|QEMEA-MDC2||01d.12h:32m:19s||3/17||17||(1722) The RPC server is unavailable.|
|QTEST-DC7||01d.12h:26m:32s||2/15||13||(1722) The RPC server is unavailable.|
I experienced the following operational errors trying to retrieve replication information:
58 - qtest-dc22.Qtest.cpqcorp.net
58 - qemea-dc32.Qemea.Qtest.CPQcorp.net
58 - qtest-dc5.Qtest.cpqcorp.net
58 - BEDROCKDC4.jc.qamericas.qtest.cpqcorp.net
58 - kparkhurst4.qamericas.qtest.cpqcorp.net
58 - Bedrockdc5.jc.qamericas.qtest.cpqcorp.net
58 - qemea-dc4.Qemea.Qtest.CPQcorp.net
Let's take a look at the report and see what we can learn from it. The report itself lists all DCs in the forest – from all three domains. The first section is labeled "Source DC" and lists all the DCs in all domains who are sources for replication (outbound). Note the following points:
- There are four DCs that have not replicated for more than 60 days: Qtest-DC5, Qtest-DC22, Kparkhurst4 and Qemea-DC32. This isn't a good thing but not necessarily a problem. Any object changes – create, modify, delete – on these DCs have not been replicated to the other DCs (or each other) in the past 60 days. If replication is restored, those changes will be replicated, and depending on what has happened to them on other DCs, they may or may not get replicated. While it may be of no consequence, it's best to remove the DC from the domain.
- Note that in the far right column we see the error causing failure. Three of them are DNS errors! No need to scan event logs. We have what we need here.
- We can see that Qemea-dc4 hasn't replicated for 1.5 days due to "RPC Server Unavailable", so we should investigate and resolve that error.
- The others seem healthy.
The second section is labeled "Destination DC" and lists each DC as a Destination of replication or inbound replication.
- We see Qemea-DC3, Qtest-DC9, Qemea-DC7, Qamericas-MDC1 and Qamericas-DC39 have not had inbound replication in more than 60 days. This is a problem. Any objects that were deleted on other DCs in the past 60 days are now purged, but since the deletion was not replicated to these machines, the objects are still alive. If replication is restored, these objects will be injected back into the AD (if "loose behavior" is enabled) or it will stop replication (if "tight behavior" is enabled).
- We see Qemea-MDC2 and Qtest-DC7 have failed replication for 1.5 days due to "RPC Server is unavailable" so we need to go check connectivity on those DCs.
The last section simply lists the DCs that this command couldn't run on due to connectivity or other failures.
Note that some DCs have inbound failures, some have outbound failures and some have both. It is important to distinguish between these failure types when troubleshooting.
You should be able to see what a powerful command this is. A simple command that produces a comprehensive report of replication status of all DCs in the forest -- in all domains. Repadmin has many other very powerful options that we will discuss in future articles.
Gary Olsen is a systems software engineer for Hewlett-Packard in Global Solutions Engineering. He authored Windows 2000: Active Directory Design and Deployment and co-authored Windows Server 2003 on HP ProLiant Servers.
More information from SearchWinIT.com
- Learning Center: Active Directory backup and recovery
- Topic: Active Directory
- RSS: Sign up for our RSS feed to receive expert advice every day