When (and when not) to use Windows server failover clustering

Failover clustering offers several benefits to those running Windows Server 2008, including simpler patching processes and increased business continuity. Still, IT pros should also be aware of the potential disadvantages of clusters before implementing them in their environments.

Confused about the benefits of adding Windows Server 2008 failover clustering to your environment? You're not alone....

There's a lot of confusion in IT today about when and where clustering fits as a solution for improving service reliability. Server clusters are implemented all the time in IT organizations, but sometimes they're not added to the environment for the right reasons.

First and foremost, adding Microsoft clustering to an existing service can significantly increase the cost of supporting that solution. This is obvious when you consider factors such as clustering's shared storage requirements, added cabling and networks and more expensive editions of Microsoft Windows. Above all, the extra switches, dials and knobs that clustering adds to managing a hosted service at the same time creates a more complex environment.

Yet there are still some specific situations in which clustering can assist your uptime. In making any clustering decision, consider what the right reasons are to take advantage of its improved availability features. Remember, the added complexity won't necessarily outweigh the benefits.

Reason #1: Clustering reduces the impact of hardware outages

When a server motherboard fails, that server usually goes down hard. Such hardware failures often result in a long-term outage of the hosted service due to the time delay in acquiring and replacing the failed part. If maintenance agreements are in place with traditional server-class vendors, it could mean a half to full day of downtime. If no agreements are in place, that time could be significantly longer. For highly-critical services, long delays like these are unacceptable.

In making any clustering decision, consider what the right reasons are to take advantage of its improved availability features.

Failover clustering provides a location for a service to automatically re-host itself when a failure occurs, which takes away the urgency of obtaining and installing the failed part. A clustered service incurs an outage of only a few seconds or minutes rather than hours or days.

Still, there's reason for caution when implementing server clustering for this purpose alone. These days, server-class hardware comes equipped with multiple levels of redundancy. Hard drives are RAIDed, network cards are teamed; some servers even incorporate redundancy within the internal components as well. All of these reasons reduce the likelihood that a component failure will lead to the catastrophic loss of an entire system, which means you may already have the redundancy you need built into your server hardware.

Reason #2: Clustering takes the pain out of software problems

Using Microsoft Windows to host a service involves more than just processing the needs of the service. Windows alone has all kinds of moving parts, and most environments add more software to servers for things like backups, systems management, monitoring and remote control. All of these software packages at some point can conflict in a way that causes the server to stop processing your critical service.

When this occurs, server clustering can relocate the service to another node where problems do not exist. Relocation gives the administrator precious time to fix that software conflict without the added strain of a critical service failure. The result leads to better fixes and fewer "band-aids."

And yet this reason only holds true for situations where "other" software is causing the problem. In situations where your critical service is the problem, clustering's added machinery can in some cases make the troubleshooting and resolution process more difficult.

Reason #3: Clustering makes OS patching less painful

Every month Microsoft releases yet another round of patches for its products. Ranging from low priority to exceptionally critical, these patches need to be installed to host machines as soon as operationally possible. The problem with patches is that many require a reboot of the system to be fully installed. That reboot impacts the uptime of the hosted service.

More on Windows server cluster management

Take control of server clusters with Microsoft's ClusDiag tool

Microsoft tool simplifies Windows server cluster configuration

Backing up and restoring server cluster nodes

Adding clustering to the mix enables an IT environment to relocate the service to another cluster node prior to patching, allowing the patch install and subsequent reboot to occur without affecting the service. Once you're complete with the first node, you can then relocate the service and continue patching without impact.

However, once again this reason may not be enough. One of Microsoft's improvements with the release of Windows Server 2008 is a reconfiguration of patches themselves; fewer of them actually require a reboot to complete. Also, at times the hosted service itself requires patching, and patching a hosted service often requires a reboot, which means downtime anyway. Your mileage will vary.

Reason #4: Clustering can be a form of disaster recovery

Using traditional failover clustering, cluster nodes must be directly attached to some form of shared storage. This storage is used for quorum information as well as the storage of data that is processed by nodes of the cluster. As such, the physical positioning of each cluster node is limited by the length of the cabling that separates the node from its storage.

Traditional clusters require this direct connection to centralized shared storage for all cluster hosts, which means a disaster that impacts one node is likely to impact others. As an alternative, Windows Server 2008 includes enhanced support for geographically-dispersed clusters, also called stretch clusters or geo-clusters.

These special clusters enable the "stretching" of cluster nodes across great distances. However, they also involve extra cost in network connectivity, added storage and usually third-party data replication between sites. In addition, they can add a significant level of complexity to existing services, which means they're best reserved for only the most critical of services.

So the moral of today's story is to be conscious of both pros and cons when considering whether to add failover clustering to an existing Windows service. While often (and incorrectly) failover clustering is assigned "magic bullet" status for preventing large swaths of possible outages, its design is tailored to protect against only a specific few.

Greg Shields, MVP, is an independent author and consultant based in Denver with many years of IT architecture and enterprise administration experience. He is an IT trainer and speaker on such IT topics as Microsoft administration, systems management and monitoring, and virtualization. His recent book Windows Server 2008: What's New/What's Changed is available from Sapien Press.

Dig Deeper on Windows Server deployment