Anyone who has been around Active Directory for any length of time in the past seven years is aware that the lion's share of AD problems are to some degree caused by replication failures. One of the most notorious replication errors is the Event ID 1311, whose description says:
In an attempt to help resolve these errors, Microsoft created KB 307593, How to troubleshoot Event ID 1311 messages on a Windows 2000 domain. While it is a decent article, it is woefully outdated, even though the last review is dated October 2006. It is definitely stuck in the Windows 2000 era, with a lot of space devoted to a previously fixed error in one of the repadmin functions (called "showism").
Here, I'll sift the wheat from the chaff for you and give you a concise list of causes and solutions for this event. In a nutshell, this event is caused by one of the following general reasons:
The first step in resolving this replication error is to determine the scope of the error. The easiest way to do this is with the Repadmin/Replsum command as I explained in a previous article, Use command line tools to monitor Active Directory. This will give you a complete summary of all the DCs in the forest, including the relevant event ID if it is in an error state. The general form of the command is this:
Here is a sample output of this command. Note that there are four domain controllers failing replication. While the 1311 may not show up here, it is common for it to be paired up with the 1722 event (which basically means no physical connectivity). This command will allow us to focus on the four servers logging errors rather than all DCs in the domain or forest.
Replication Summary Start Time: 2005-10-21 00:02:56
Beginning data collection for replication summary, this may take awhile:
...................
...
To continue reading for free, register below or login
To read more you must become a member of SearchWindowsServer.com
');
// -->

Source DC largest delta fails/total %% error
QTEST-DC5 10d.10h:16m:46s 31 / 31 100 (8524) The DSA operation is unable to proceed because of a DNS lookup failure.
QEMEA-DC4 09d.00h:16m:45s 3 / 3 100 (8524) The DSA operation is unable to proceed because of a DNS lookup failure.
BEDROCKDC5 07d.10h:06m:22s 5 / 5 100 (1722) The RPC server is unavailable.
BEDROCKDC4 07d.10h:06m:20s 5 / 5 100 (1722) The RPC server is unavailable.
QAMERICAS-MDC1 07d.06h:17m:40s 22 / 22 100 (1722) The RPC server is unavailable.
KPARKHURST4 03d.11h:13m:49s 12 / 12 100 (1722) The RPC server is unavailable.
QAMERICAS-DC39 17m:55s 0 / 21 0
QTEST-DC9 17m:55s 0 / 25 0
QTEST-DC22 17m:55s 0 / 20 0
QEMEA-MDC1 17m:01s 0 / 47 0
QAMERICAS-DC2 15m:59s 0 / 16 0
Physical connectivity
Obviously, if there is a network failure, replication isn't going to happen. The first thing to do is to check the general health of the domain using the Repadmin /replsum command just described. You can also ping broken DCs by address and FQDN, and you can run NetDiag and DCDiag commands from the command line (with the /v switch on each). This will give you more details about the errors and perhaps related ones.
The network connecting all the sites should be fully routed. Don't create a site link if there is no underlying network link to get between the sites in the site link.
Logical connectivity
This is a bit more difficult to diagnose. It means, bottom line, that something in the AD site topology configuration is wrong, creating a hole in the topology. This could be solved by one of the following actions:
Figure 1
[IMAGE]
Seattle, Denver and Dallas are connected, but there is nothing connecting them with Chicago-Atlanta. We could fix this with something like an Atlanta-Dallas site link or simply put them all in a single site link. Typically, this is not a problem because most topologies are some form of hub and spoke. But you could have a situation, as seen in Figure 2, with a couple of hubs with remote sites off of each one, by forgetting to build a site link between the two hub sites. While it looks fairly simple in these examples, if you have hundreds of sites, it's easy to miss one.
Figure 2
[IMAGE]
Orphaned objects
In one case, a global catalog server was demoted, but an impatient administrator wanted to "clean up" the Active Directory, so he shortened the tombstonelifetime value and then forced garbage collection. Unfortunately, he did that before the deletion of the global catalog (GC) server was completed to all DCs and GCs in the forest. We saw 1311 events along with a host of others stating that Active Directory was trying to replicate an object that had no parent, but it didn't identify the parent. The deletion process deleted the parent object but not the child. We turned on verbose logging and finally identified the GUID of the problem object. Using the LDP.exe tool, we were able to delete that object and stop the 1311 events.
DNS errors
Since AD replication relies on DNS name resolution to find DCs to replicate with, if DNS is broken, it could cause the 1311 events to occur. The helpful thing here is that if DNS is the culprit, the 1311 event will have the phrase "DNS Lookup Failure" included in the description. If you see this phrase, then you absolutely, positively have a DNS problem that must be fixed. I've never seen this error turn out to be bogus. Note that this will not necessarily log an event in the DNS log, and you will see it in other events as well. Remember that just because there are no significant errors in the DNS event log, it doesn't mean DNS is healthy.
When debugging 1311 events, you should get a scope of the entire forest to see which DCs are not replicating. You can do this easily using the Repadmin /Replsum command as described in this article. Note that the loss of physical connectivity, an incomplete AD site topology or DNS failure usually cause these events, with an outside chance it will be an orphaned object. Usually, other events will accompany them, such as the 1722 (RPC Server Unavailable), or the event will contain a descriptive statement such as "DNS Lookup Failure." This is a critical event that must be resolved in order for Active Directory replication to function properly to all DCs.
Gary Olsen is a systems software engineer for Hewlett-Packard in Global Solutions Engineering. He authored Windows 2000: Active Directory Design and Deployment and co-authored Windows Server 2003 on HP ProLiant Servers.