In a previous article by Gary Olsen, Best practices for Active Directory replication topology design, he discussed some basic principles in regard to designing Active Directory replication topology. This week, he recalls a case study in which a poor design eventually resulted in serious replication problems and caused the topology to be re-designed and implemented.
The company in this week's case complained that after a domain controller failed and became unavailable, the replication didn't flow as the designers had planned. We were able to work with them to figure out what went wrong and how to fix it -- all on the fly.
In the sidebar, you'll find the best practices identified in last week's article which we will use here to analyze this company's AD replication topology to help resolve the problem.
Identifying where things went wrong
Figure 1 shows a graphical illustration of the replication topology at this corporation.
We examined the topology, and it was apparent that the designers intended to create a multi-tier replication topology, forcing the sites at the lower bandwidth locations (tier 3) to replicate to regional hubs (level 2), which, in turn, replicates to two "core" hubs (tier 1). Unfortunately, when they implemented this design, they failed to observe best practice number four, and if they would have diagrammed the flow (best practice number five), they most likely would have seen that the design just didn't make sense.
Figure 1 only shows a part of the whole topology. The company in this case collected the Northwest sites -- Calgary, Sacramento, Portland, Boise and Seattle -- and put them into one site link. The company repeated this for other regional groupings of sites -- defining each group into a site link, so there was a Southwest link, Southeast link and a Northeast link as well.
The core link contained the company's two hub sites in Boston (East) and Chicago (West). The problem with this topology is that it violates best practice number three: There was no topology defined to connect the three tiers together. That is, you could replicate within the sites in the Northwest link, and between the sites in the Southwest link, and even in the core site links, but they couldn't replicate from, say, Seattle to Los Angeles. This is a perfect example of what would cause the infamous Event 1311 in the Directory Services event log.
Fixing the design can be trial and error
To make this work, the designers decided to connect the links by putting regional hub sites that were in the third tier (such as the Northwest link) in the second tier (West) link as well. They also added the appropriate core site link to the second tier site link (see Figure 2). Note that Sacramento, the regional hub, is now part of two links -- the West link and the Northwest link. The first and second level tiers are connected by putting the Chicago site in the West link as well as in the core link. That satisfies best practice number three and connects the topology together.
Although it still violates best practice number four, it did work. It worked until the DC in the Sacramento site had a disk drive fail, which caused them to rebuild the Sacramento DC. For reasons we still don't fully understand, since the KCC had used Sacramento to connect to the West link, and since we had full bridging turned on, the KCC decided to elect another site to replace Sacramento in the topology. For whatever reason, it picked Portland, and replication worked.
Meanwhile, the Sacramento DC comes back online, but the KCC still insists on using Portland. Portland isn't a Level 2 site and thus doesn't have the bandwidth that Sacramento has, so performance suffered. They wanted to get it back the way it was. After trying unsuccessfully to repair it, they rebuilt the Sacramento DC, and the KCC recreated the routing to funnel through Sacramento. All was well until the DC in Los Angeles went out. Los Angeles was the regional hub for the Southwest link. Same thing happened -- with LA gone, the KCC elected the Las Vegas site as the Southwest link "hub." And when LA came back online, the routing didn't change and performance issues arose. Just like for the Northwest link, they rebuilt the Los Angeles DC and replication was routed successfully through Los Angeles again. In addition, they found some third-level sites that had connections created to them from all other sites in the AD. This was indeed strange behavior.
Do I have to rebuild the DC every time?
Now they are wondering if every time a DC in a hub site goes down, would they have to rebuild it? This was not acceptable.
After diagramming the topology (Figure 2), we thought it should work. However, I'd never seen site links created with many sites in a single site link like they had done. I reasoned that we had violated best practice number eight. That is, by putting several sites in a site link, when a failure occurred and the KCC had to rebuild the topology, it picked another site to hook to the next tier. I asked an expert on replication at Microsoft if perhaps it was picking the DC based on the GUID. He was as baffled about this as I was but thought that was a good guess anyway.
The solution, of course, was to obey best practice number four and not allow any site links to have more than two sites. That makes the topology look like Figure 3, with the red and blue lines representing site links and the numbers representing the site link cost. I advised the administrator to delete all site links except the core and to create new site links, each with only two sites, then connect the regional hubs. That is, connect tier 3 hubs to their corresponding tier 2 hubs, etc.
The administrator asked me if that would affect his production AD environment. I told him this action would definitely affect the AD … it would fix the problem! He actually just reconstructed the links in the Northwest site link, creating three new site links, each with Sacramento in it. So that gave him Sacramento-Portland, Sacramento-Calgary and Sacramento-Seattle links. It indeed fixed the problem. They tested it by taking the Sacramento DC out and putting it back in, and it routed replication properly. He then rebuilt the rest of the topology.
The lessons to be learned here are:
Follow all of the best practice rules. Note that there are a lot of ways to force replication to work in a topology, but if it isn't designed well, it will probably break. Rule #8 is critical. Defining the site links with only two sites in them prevented the KCC from choosing a hub site other than the one the administrator wanted. This has application in every design. Don't let the KCC make any decisions. Force it to funnel replication the way you want it to go.
We were able to fix a bad design literally on the fly without affecting our production AD. Of course we used common sense and rebuilt it over the weekend, but it was not a significant outage and the administrative work was minimal.
AD replication topologies are not set in stone. It is best to fix them and eliminate the problems than to live with the problems.
Gary Olsen is a systems software engineer for Hewlett-Packard in Global Solutions Engineering. He authored Windows 2000: Active Directory Design and Deployment and co-authored Windows Server 2003 on HP ProLiant Servers.