When Active Directory replication fails: Debugging Event ID 1311

Perhaps the most infamous Active Directory replication error is Event ID 1311. But how do you resolve it? Expert Gary Olsen breaks down all of the factors you must consider when troubleshooting the error.

Anyone who has been around Active Directory for any length of time in the past seven years is aware that the lion's...

share of AD problems are to some degree caused by replication failures. One of the most notorious replication errors is the Event ID 1311, whose description says:

The Directory Service consistency checker has determined that either (a) there is not enough physical connectivity published via the Active Directory Sites and Services Manager to create a   spanning tree connecting all the sites containing the Partition   CN= ,DC= ,DC=com, or (b) replication cannot be performed with   one or more critical servers in order for changes to propagate   across all sites (most often because of the servers being   unreachable).

In an attempt to help resolve these errors, Microsoft created KB 307593, How to troubleshoot Event ID 1311 messages on a Windows 2000 domain. While it is a decent article, it is woefully outdated, even though the last review is dated October 2006. It is definitely stuck in the Windows 2000 era, with a lot of space devoted to a previously fixed error in one of the repadmin functions (called "showism").

Here, I'll sift the wheat from the chaff for you and give you a concise list of causes and solutions for this event. In a nutshell, this event is caused by one of the following general reasons:

  • Loss of physical connectivity
            *Network failure between DCs
            * DC offline (powered off, NIC/cable failure, etc.)
  • Loss of logical connectivity
            * Incomplete mesh of sites in site links (hole in the topology)
            * Preferred bridgehead server configured and offline
            * Server overloaded
            * Links disjointed
  • Orphaned objects
  • DNS name resolution failure
  • 1311 could be simply a result of another, bigger problem and will go away when that problem is resolved.

The first step in resolving this replication error is to determine the scope of the error. The easiest way to do this is with the Repadmin/Replsum command as I explained in a previous article, Use command line tools to monitor Active Directory. This will give you a complete summary of all the DCs in the forest, including the relevant event ID if it is in an error state. The general form of the command is this:

Repadmin /Replsum /bysrc /bydest /sort:delta

Here is a sample output of this command. Note that there are four domain controllers failing replication. While the 1311 may not show up here, it is common for it to be paired up with the 1722 event (which basically means no physical connectivity). This command will allow us to focus on the four servers logging errors rather than all DCs in the domain or forest.

Replication Summary Start Time: 2005-10-21 00:02:56
Beginning data collection for replication summary, this may take awhile:

Source DC largest delta fails/total %% error

QTEST-DC5 10d.10h:16m:46s 31 / 31 100 (8524) The DSA operation is unable to proceed because of a DNS lookup failure.

QEMEA-DC4 09d.00h:16m:45s 3 / 3 100 (8524) The DSA operation is unable to proceed because of a DNS lookup failure.

BEDROCKDC5 07d.10h:06m:22s 5 / 5 100 (1722) The RPC server is unavailable.
BEDROCKDC4 07d.10h:06m:20s 5 / 5 100 (1722) The RPC server is unavailable.
QAMERICAS-MDC1 07d.06h:17m:40s 22 / 22 100 (1722) The RPC server is unavailable.
KPARKHURST4 03d.11h:13m:49s 12 / 12 100 (1722) The RPC server is unavailable.
QAMERICAS-DC39 17m:55s 0 / 21 0
QTEST-DC9 17m:55s 0 / 25 0
QTEST-DC22 17m:55s 0 / 20 0
QEMEA-MDC1 17m:01s 0 / 47 0
QAMERICAS-DC2 15m:59s 0 / 16 0 

Physical connectivity

Obviously, if there is a network failure, replication isn't going to happen. The first thing to do is to check the general health of the domain using the Repadmin /replsum command just described. You can also ping broken DCs by address and FQDN, and you can run NetDiag and DCDiag commands from the command line (with the /v switch on each). This will give you more details about the errors and perhaps related ones.

The network connecting all the sites should be fully routed. Don't create a site link if there is no underlying network link to get between the sites in the site link.

Logical connectivity

This is a bit more difficult to diagnose. It means, bottom line, that something in the AD site topology configuration is wrong, creating a hole in the topology. This could be solved by one of the following actions:

  • Configure a preferred bridgehead server -- This is an old Windows 2000 hack that really is not needed anymore. In Windows 2003, all DCs are randomized as bridgehead servers instead of having a single one as required with Windows 2000. Setting this forces one DC to be the bridgehead server -- and if you only set one, and it fails, the Knowledge Consistency Checker (KCC) will not find another. My advice? Just say no on this. If you have any of these hanging around, undo them by unchecking the box in the Sites and Services Snap-in.
  • Make sure all sites are defined in site links -- This might seem obvious, but you'd be surprised at how often this is the problem. In one case, an administrator reported that one region containing several AD sites was not replicating at all. Upon examination, I found that the hub site in that region had no site links defined for any of the sites. I was amazed that they hadn't discovered this sooner, since there was no replication to any other DC at all.
  • Make sure there is a complete mesh of sites in site links -- This is best illustrated graphically. In Figure 1, there are five sites -- Chicago, Atlanta, Dallas, Denver and Seattle. We have the following site links:

    Figure 1

    Seattle, Denver and Dallas are connected, but there is nothing connecting them with Chicago-Atlanta. We could fix this with something like an Atlanta-Dallas site link or simply put them all in a single site link. Typically, this is not a problem because most topologies are some form of hub and spoke. But you could have a situation, as seen in Figure 2, with a couple of hubs with remote sites off of each one, by forgetting to build a site link between the two hub sites. While it looks fairly simple in these examples, if you have hundreds of sites, it's easy to miss one.

    Figure 2

Orphaned objects

In one case, a global catalog server was demoted, but an impatient administrator wanted to "clean up" the Active Directory, so he shortened the tombstonelifetime value and then forced garbage collection. Unfortunately, he did that before the deletion of the global catalog (GC) server was completed to all DCs and GCs in the forest. We saw 1311 events along with a host of others stating that Active Directory was trying to replicate an object that had no parent, but it didn't identify the parent. The deletion process deleted the parent object but not the child. We turned on verbose logging and finally identified the GUID of the problem object. Using the LDP.exe tool, we were able to delete that object and stop the 1311 events.

DNS errors

Since AD replication relies on DNS name resolution to find DCs to replicate with, if DNS is broken, it could cause the 1311 events to occur. The helpful thing here is that if DNS is the culprit, the 1311 event will have the phrase "DNS Lookup Failure" included in the description. If you see this phrase, then you absolutely, positively have a DNS problem that must be fixed. I've never seen this error turn out to be bogus. Note that this will not necessarily log an event in the DNS log, and you will see it in other events as well. Remember that just because there are no significant errors in the DNS event log, it doesn't mean DNS is healthy.

When debugging 1311 events, you should get a scope of the entire forest to see which DCs are not replicating. You can do this easily using the Repadmin /Replsum command as described in this article. Note that the loss of physical connectivity, an incomplete AD site topology or DNS failure usually cause these events, with an outside chance it will be an orphaned object. Usually, other events will accompany them, such as the 1722 (RPC Server Unavailable), or the event will contain a descriptive statement such as "DNS Lookup Failure." This is a critical event that must be resolved in order for Active Directory replication to function properly to all DCs.

Gary Olsen is a systems software engineer for Hewlett-Packard in Global Solutions Engineering. He authored Windows 2000: Active Directory Design and Deployment and co-authored Windows Server 2003 on HP ProLiant Servers.

Dig Deeper on Windows systems and network management