But why is the communication between cluster nodes so important? After all, the nodes do not share a central data store, and each node performs its own independent calculations to determine which server will handle an inbound request.
Even so, cluster nodes do communicate with each other. The algorithm used to determine which cluster node will handle the next inbound request is based on the number of nodes in the cluster. As such, the nodes must have a mechanism for communicating the cluster size. As nodes are added to or removed from the cluster, the algorithm must change to reflect the new number of nodes. So it is important to communicate these changes to all nodes in the cluster.
The communications that occur between cluster nodes take the form of heartbeat messages. Each cluster node transmits a heartbeat message to the other nodes once per second. If a cluster node were to fail, it would obviously stop transmitting heartbeat messages to the other nodes. The remaining nodes would then detect the absence of a heartbeat from the failed node and take the appropriate action.
The heartbeats have a minimal effect on network performance. After all, cluster nodes with two NICs use one of those NICs solely for a heartbeat messages. Therefore, heartbeats are transmitted along
Even if the cluster nodes have only a single NIC, heartbeats should not greatly impact performance. Typical heartbeat messages are less than 1500 bytes, small enough to fit into a single Ethernet packet. This does not usually translate into enough traffic to cause any noticeable performance degradation. Even so, it's still better to install two NICs into each cluster node, and use a dedicated network segment for heartbeat traffic.
Just because a cluster node fails to transmit a heartbeat does not mean the node has failed. Even under perfectly normal conditions, a cluster node could skip a heartbeat or two. For example, if a server is too busy to process the instruction that causes a heartbeat to be generated, the heartbeat might be skipped. Since occasionally skipping a heartbeat is normal, cluster nodes do not take action based on the absence of a single heartbeat message. A cluster node is not considered to have failed until five consecutive heartbeat messages have been missed.
When a cluster node skips five consecutive heartbeats, the other nodes in the cluster initiate a process called convergence. Convergence involves recalculating the traffic distribution algorithm so that traffic will no longer be distributed to the failed node. When the failed server is brought back online (or any time that additional servers are added to the cluster), the convergence process happens again. But in this case, the algorithm is adjusted to reflect an increase in the number of nodes in the cluster.
Without communications between cluster nodes, it would be impossible to adjust the traffic distribution algorithm based on the cluster size. In the next article, you will learn how to use information derived from heartbeat messages to monitor the status of a Network Load Balancing cluster.
About the author: Brien M. Posey, MCSE, is a Microsoft Most Valuable Professional for his work with Windows 2000 Server, Exchange Server and IIS. He has served as CIO for a nationwide chain of hospitals and was once in charge of IT security for Fort Knox. He writes regularly for SearchWinComputing.com and other TechTarget sites.
More information on this topic:
- Tip: Deploying a Network Load Balancing cluster
- Topics: Network load balancing
- RSS: Sign up for our RSS feed to receive expert advice every day.
This was first published in December 2006