Although cluster continuous replication (CCR) tends to be very reliable, the replication process can break for unknown reasons. Several of Microsoft’s Exchange Management Shell cmdlets can be useful tools in troubleshooting CCR failures.
The most useful cmdlet for troubleshooting CCR-related replication failures is the Get-StorageGroupCopyStatus cmdlet. The command instructs Exchange Server to report on the storage group’s health. The Get-StorageGroupCopyStatus cmdlet also reports on the current copy queue length, replay queue length and the last inspected log time (Figure 1).
Figure 1 shows what it looks like when the Get-StorageGroupCopyStatus cmdlet runs on a node within a healthy cluster. The most common problem I’ve seen when using this command was that the LastInspectedLogTime field reported that log file inspections weren’t current. A few things can cause this:
The passive node isn’t running the Microsoft Exchange Replication service -- This is by far the most common cause for outdated log inspection times.
The mailbox database is not mounted -- You will probably already know whether or not this is the case.
Slow mail flow -- In my organization I only have a couple production mailboxes on my clustered mailbox server. Because spam is filtered before it reaches the mailbox server, there are times -- especially on the weekends -- when the date stamp in the LastInspectedLogTime field is old. This doesn’t mean there is a problem; an out-of-date log stamp means that there wasn’t enough mail flow to trigger the log shipping process.
Common CCR failure messages
Sometimes the Get-StorageGroupCopyStatus cmdlet will report that the database has failed. When this happens, the cmdlet’s output will contain a field called FailedMessage, which explains the reason for the failure.
The two most common failure messages are Not Seeded and Storage Group Copy Has Diverged. The Not Seeded message means that the passive copy of the database does not have a usable baseline. Although this problem can be related to the database configuration, a failure to initially seed the passive node is the most common reason. In most cases, using the Update-StorageGroupCopy cmdlet to seed the storage group copy can remedy the problem.
If you receive the Storage Group Copy Has Diverged message, it signals that something has caused the data on the passive node to differ from data on the active node, halting synchronization. This failure is almost always the result of a failover.
When a failover occurs, multiple log files can be lost. If this happens, the failed server cannot be resynchronized with the previously passive node when it comes back online. The solution is to act as if the storage group copy hadn’t been seeded and re-seed the database using the Update-StorageGroupCopy cmdlet.
If the Get-StorageGroupCopyStatus cmdlet does not provide enough information about a cluster continuous replication failure, you can also use the Get-ClusteredMailboxServerStatus cmdlet.
This cmdlet provides basic information on the status of the clustered mailbox server (Figure 2). In the event of a failure, the command will state which specific resources have failed.
Note: If a clustered mailbox server is in a transitional state, this cmdlet may incorrectly report a failure.
In addition to the Get-StorageGroupCopyStatus and Get-ClusteredMailboxServerStatus cmdlets, Microsoft provides a comprehensive list of best practices for troubleshooting problems with clustered mailbox servers.
ABOUT THE AUTHOR
Brien M. Posey, MCSE, is a seven-time Microsoft MVP for his work with Windows 2000 Server, Exchange Server and IIS. He has served as CIO for a nationwide chain of hospitals and was once in charge of IT security for Fort Knox. For more information visit www.brienposey.com.