Building a simple failover cluster with Windows Server 2008 isn't for the faint of heart, even considering Microsoft's new validation features built right into its management interface. Creating a multi-site, geographically-dispersed cluster – often called a GeoCluster – involves a whole new level of challenge.
Adding clustering to a critical service significantly reduces the loss of service associated with these types of events. It does not, however, protect the environment from the loss of an entire site. The reason for this lies in the length between the cables that connect each cluster node to its shared storage. Traditional forms of Windows failover clustering require each node to connect to shared storage in a single location. Lose the location, and you've lost the cluster.
If you want disaster recovery for your clustered services, your solution will be to "stretch" them across two different sites. Doing this involves at a minimum creating two separate but replicated shared storage locations one at each site. The biggest problem here is that Microsoft does not provide the mechanism to do this replication right out of the box. Doing so requires third-party support (Microsoft provides a list of supported vendors here).
Depending on the characteristics of your connection between sites, there are options for how replication works. Storage replication between nodes of a multi-site cluster can leverage block-level replication using storage hardware or file system-level replication using software. The replication can either occur synchronously, where individual changes are acknowledged by nodes before moving to the next change, or asynchronously, which speeds data transfer but adds the risk of data loss in the case of a failure. Each third-party replication tool vendor on Microsoft's website -- Double-Take Software, Neverfail and SteelEye Technology -- offers a different mechanism to accomplish this data replication and Microsoft discusses them all.
Dealing with the vagaries of storage replication is only your first architectural decision. Connecting clusters across long distances requires a hard look at the network as well. In this space, Microsoft has improved the process somewhat through the addition of two new capabilities:
1. Elimination of broadcast for cluster heartbeat communications. In previous versions of Windows clustering, cluster nodes kept in communication with each other using broadcast communication. This type of communication required nodes to be connected with crossover cables or within the same broadcast domain within the network infrastructure. Spanning that broadcast domain –- typically done using a VLAN –- across long distances was exceptionally challenging considering the configuration of most corporate networks. So with Windows Server 2008, Microsoft reconfigured this communication to operate using TCP rather than broadcasts. The result is that cluster heartbeat communication can now span network routers.
Even with this change, if you plan to stretch your cluster across your intranet, implementing it will likely require some tailoring of tolerance for cluster heartbeats. Using the Cluster.exe command-line tool, you may need to adjust the communication thresholds between nodes. Your mileage will vary depending on the bandwidth and latency characteristics of your network, so testing is critical. Look for and tweak the cluster properties SameSubnetDelay, CrossSubnetDelay, SameSubnetThreshold and CrossSubnetThreshold to get the results you need, but be aware that any adjustments here can impact the amount of time needed to complete a failover.
2. Addition of the Node and File Share quorum model. Using the Node Majority model adds certain complexities when multiple nodes are present at each site. If one site has two nodes and another hosts three, any loss of connectivity between sites can result in the site with fewer nodes losing its quorum entirely.
To resolve this problem, Microsoft created the Node and File Share quorum model. With this model, rather than relying on the count of nodes for declaring quorum, a separate file server and share are used. This remote file share can be hosted at the same site as one of the cluster nodes or in a separate site all its own. Even better, that file share is all that's required to set up the quorum drive using this model. When you use this configuration for a stretched two-node cluster, the loss of connectivity across any WAN connection will not result in a loss of quorum.
No matter which quorum model you choose for your cross-site cluster, be careful how you set up dependencies between cluster resources. Since it is likely that cluster addresses can differ across sites, you will need to modify your resource dependencies so they are aware of both sets of addresses.
Lastly are the problems associated with clients that are attempting to get to cluster resources. When services are hosted atop a cluster, the cluster itself assumes the naming and addressing required to get to those services over the network. When a client attempts to reach a service, it does so to a clustered network name and IP address. Because that name and IP can be logically located on any of the cluster nodes at any point, the cluster works with DNS to ensure that records are always pointing to the right location.
When considering multi-site clusters, you must also be aware of the impact of DNS resolution at the client. When a cluster resource fails over to an alternate node in a new physical location, DNS must get updated with that new location. If DNS between the two sites does not quickly replicate the change out to all clients, clients will take longer to "find" the relocated service.
There are surprisingly few resources available on the Internet today that provide specific step-by-step instructions for setting up GeoClusters. If you plan to implement one, you'll have to dig deep to find the information you need. The best place to start may actually not be on Microsoft's website, but instead at the sites of third-party vendors that build the necessary replication software.
ABOUT THE AUTHOR
Greg Shields, MVP, is an independent author and consultant based in Denver with many years of IT architecture and enterprise administration experience. He is an IT trainer and speaker on such IT topics as Microsoft administration, systems management and monitoring, and virtualization. His recent book Windows Server 2008: What's New/What's Changed is available from Sapien Press.
This was first published in October 2008