Problem solve Get help with specific problems with your technologies, process and projects.

How Dynamic Quorum keeps Exchange 2013 clusters running during failure

Introduced in Windows Server 2012, Dynamic Quorum helps keep Exchange clusters up and running during a data center failure. Here's how it works.

Smaller Exchange Server clusters typically struggle when it comes time for Windows updates or data center shutdowns....

If you are running a Database Availability Group split across two sites and need to shut down the primary data center -- the one that hosts Mailbox Servers and the File Share Witness, for example -- you traditionally need to consider moving the FSW to the secondary site or risk that the DAG will lose quorum and shut down. If you lose power within a data center, things can be more problematic. Dynamic Quorum, however, can ease such situations.

What does Dynamic Quorum do?

Dynamic Quorum is a new feature in Windows Server 2012 that automatically adjusts your cluster quorum to allow it to run with less than half of the cluster up and running, as long as other nodes are cleanly shut down. In previous versions of Windows (and Exchange), more than half of the nodes online had to be working in a majority node set cluster. Dynamic Quorum removes this restriction.

To illustrate this, we'll use a simple environment with a two-node Exchange DAG spread across two sites (Figure 1). The first data center acts as the primary location for Exchange and hosts the FSW, along with the first Exchange node. The second data center hosts the second server.

In this type of environment, we're not necessarily looking at a disaster recovery site, but two well-connected data centers that may be close together and can act as part of the same site. We'll assume there's also a pair of hardware load balancers providing automatic failover between data centers, resilient Internet connectivity and communications links to the organization. These load balancers ensure Exchange Server has everything else it needs to stay online in the event of a data center failure.

Exchange Database Availability Group
Figure 1: Two-node Exchange DAG with FSW.

DAG resilience without Dynamic Quorum

Let's look at how a normal DAG for Exchange 2010 that's running on Windows Server 2008 R2 functions without Dynamic Quorum during a node failure. In the first example, a typical DAG is designed to protect against a single-node failure (Figure 2). We expect the cluster and the DAG will stay online.

Node, data center loss
Figure 2: Loss of the second node or data center.

Another scenario involves a loss of the cluster quorum and shutdown of the DAG (Figure 3). If we lose a node and the FSW, the DAG will be offline -- even if they both shut down cleanly.

File Share Witness loss
Figure 3: Loss of the first node and FSW.

In a planned data center shutdown -- where power to the data center and a host is often cut cleanly -- we would have the opportunity to change the FSW to a host in the secondary data center. This allows for maintenance, but it does not help the inevitable event where an air conditioning unit overheats one weekend, servers begin to shut down, email stops working -- and somebody has to get everything up and running again.

Dynamic Quorum with Windows Server 2012 and Exchange 2013 protects not only against this scenario above, but also against scenarios where the majority of nodes in a cluster fail. In another example, we see that in the primary site, we've lost both one Exchange node and the FSW (Figure 4).

In our example, Dynamic Quorum can protect against a data center failure while the Exchange DAG remains online. This means that when the circumstances are right (we'll come to that in a moment), a power failure in your primary data center can occur and Exchange can continue to stay up and running. This can even happen for smaller environments without the need to place the FSW in a third site.

FSW loss with Dynamic Quorum
Figure 4: Loss of the first node and FSW with Dynamic Quorum.

The key caveat is that the cluster must shut down cleanly. In the previous example, where the first data center failed, we relied on a mechanism to coordinate data center shutdown. This doesn't need to be complicated, and a well-designed data center often will have this built in.

This can also protect against another scenario where there are three-node Exchange DAGs in a similar configuration -- with two Exchange nodes present in the first data center and a single node present in a second data center. As the two nodes in the first data center shut down cleanly, Dynamic Quorum will ensure the remaining node keeps the DAG online.

Dynamic Quorum can't protect against everything

Dynamic Quorum isn't a silver bullet for all data center failover scenarios, so you still need to consider availability technologies for major disasters. You may consider Database Activation Coordination, multiple DAGs for active-active sites, and new methods supported in Exchange 2013, such as locating a FSW.

Dynamic Quorum won't protect against a dual failure of the power and underlying uninterruptible power supply at the same site or against a link failure between the two data centers. The cluster service must be shut down cleanly to ensure that the cluster and DAG remain online.

Dynamic Quorum is a fantastic new feature in Windows Server 2012 that can benefit Exchange 2013. Although it doesn't protect against all failures, it's an important part of your tool set when planning Exchange 2013 deployments, and can protect against a common cause of DAG failures.

About the author:
Steve Goodman is an Exchange MVP and works as a technical architect for one of the UK's leading Microsoft Gold partners, Phoenix IT Group. Goodman has worked in the IT industry for 14 years and has worked extensively with Microsoft Exchange since version 5.5.

Dig Deeper on Exchange Server setup and troubleshooting