I wanted to share with you an event that I experienced that was a near catastrophe involving an Exchange 2000 Server clustered solution on my network.
Before I get into the actual event, here is some information about the cluster and network configuration.
- Two Dell PowerEdge 1650 servers were identically configured with Windows 2000 Advanced Server SP4, Exchange 2000 Enterprise Server SP3 and the Microsoft Clustering Service.
- The servers each had two onboard Gigabit copper NICs, one each connected to the public network, one each connected to the private network (using a dedicate switch for that purpose).
- Each PowerEdge 1650 had 2.5 GB of installed RAM, 2 36 GB SCSI drives mirrored for the operating system and additional SCSI interfaces installed for the shared storage device and tape drives.
- A Dell PowerVault 220S SCSI disk array was connected to both servers. The disk array contained 9 36 GB SCSI drives. 2 drives were configured as a RAID-1 array for volume T (transaction logs), 2 drives were configured as a RAID-1 array for volume Q (quorum logs) and the remaining 5 drives were configured as a RAID-5 array for volume M (Exchange databases).
- The individual severs were named MAIL1 and MAIL2, with MAIL being the virtual server name addressed by all clients.
- Backups were performed on a modified GFS (grandfather, father, son) rotation using ARCServe and two DLT tape drives.
- The network itself was a Windows 2000 Active Directory network operating in native mode. A Black Berry Enterprise Server (version 3.5) was also in use providing service to about 25 users.
Here is a step-by-step account of the event. To put things into context, note that the initial discovery of the failure occurred at approximately 4:45 p.m. I didn't get the problem solved on the first try. Once I worked through some initial problems, took it slow and really took the time to think about the problem and its solution, I was successful.
Actions and reactions
I was at my desk, preparing to wrap things up for the day, when I noticed that Outlook had timed out on its attempt to connect to the Exchange server. In the past, this occurrence itself would not have been a big concern as the core network (cabling and switches) was 10 MB and only recently upgraded to Gigabit copper.
My first action was to attempt to create a Remote Desktop connection to one of my two mail servers, MAIL2. Once connected to MAIL2, I attempted to open the Cluster Administrator console. The attempt to start the Cluster Administrator console was unsuccessful, using both "MAIL" and "." as the connection objects.
Looking through the event logs, still in the RDP session, I noticed a string of errors relating to a corrupt NTFS file system on volume Q. (Unfortunately, I did not preserve the log entries and they've since been overwritten.)
At this time, I headed down to the server room. My attempts to log into either MAIL1 or MAIL2 at the console proved impossible, with the login process never proceeding past the entrance of my credentials.
I tried again to login to both MAIL1 and MAIL2 again after restarting both servers, with the same results.
Assuming that the clustering service was preventing successful logins, I next restarted both MAIL1 and MAIL2 into Safe Mode. Once successfully logged into each server in Safe Mode, I disabled the Cluster Service and then stopped it.
Once I restarted both servers I was able to successfully login into them. However, I could not gain access to the M, Q or T volumes because they were still clustered resources and no node owned them. The volumes themselves showed up as "Unavailable" within the Dell Open Manage application that we used to manage the disk array.
At this time, I knew the only way out of this situation was to regain access to the clustered volumes on a single server, rebuild the quorum log and then bring the second node back into the cluster. To that end, I shut down MAIL2 at this time.
The next step I performed was disabling the Cluster Disk Service device from the Device Manager. This is one of those relatively unknown and unseen devices…and for good reason. To enable me to access the Cluster Disk Service, I had to first enable viewing of hidden devices from the View menu of the Device Manager.
Once the Cluster Disk Service is available in the list, I then disabled it as you do for any other hardware device.
On the subsequent restart and login, I was now able to directly address (and more importantly, navigate using Windows Explorer and ARCServe) volumes M, Q and T on the disk array. All data on volumes M and T appeared to be intact and salvageable. In fact, I found that the last transaction log was actually time stamped about 2 minutes before the time in the event log about the corrupt NTFS file system on volume Q.
With the knowledge that things were getting better in hand, I next proceeded to perform a full backup of volumes M and T. Knowing this would take a few hours, I headed home to get some dinner and a nap.
The day after
The next morning, after verifying that the backup completed properly (and more importantly could be used for a restoration if required), I reenabled the Cluster Disk Service in the Device Manager.
After the following restart and login, I next configured the Cluster Service for an Automatic startup.
In the command interpreter shell, I next issued the following command from the
c:\winnt\cluster directory: clussvc –debug –resetquorumlog.
At this point, I restarted MAIL2 and configured its Cluster Service for automatic. The clussvc command started the Cluster Service on both MAIL1 and MAIL2 and joined them to the cluster as nodes (yeah!) with MAIL1 owning all resources associated with the cluster group.
An oddity, however one that apparently did not have any ill effect, was that the clussvc command never terminated. After waiting about 30 minutes since the last output activity in the command shell, I manually closed the window with no error.
I ensured that the cluster was in fact operating at 100% again by failing the resource group from MAIL1 to MAIL2 and back again, leaving MAIL1 as the owner.
A happy ending
In the end, the cause of the problem was in fact caused by the fact that we were using a clustered configuration. Figure the odds! The quorum log (on volume Q) had somehow become corrupt. I never did figure out how or why the quorum log wound up corrupted…but as is usually the case, the time to spend doing a long-term investigation in the root cause of the event just wasn't the case.
The users were fairly happy (a tough crowd) because they only lost mail during off hours. I was happy because it was a valuable lesson learned that no matter how redundant a system is, there will always be a single point of failure somewhere. In the end, the Exchange cluster continued humming along with nothing out of the normal from that point forward.
A few resources that you may find useful:
- Recovering from a Lost or Corrupted Quorum Log: http://support.microsoft.com/?kbid=245762
- Cluster Service Startup Options: http://support.microsoft.com/?kbid=258078>http://support.microsoft.com/?kbid=258078
Will Schmied, BSET, MCSE, MCSA, is a messaging engineer for a Fortune 500 manufacturing company. He has written for Microsoft, Pearson, Sybex, Syngress, TechTarget and several other organizations. He has also worked with Microsoft in the MCSE exam-development process.
Do you have a useful Exchange tip to share? Submit it to our monthly tip contest and you could win a prize and a spot in our Hall of Fame.