Published: 05 Oct 2011
Just like physical Exchange servers, virtualized Exchange servers go down. So, what do you do when your virtualized Exchange infrastructure breaks? The answer is simple: Be ready for the failure before it happens.
A little preparation can make all the difference when a problem occurs. With Exchange Server virtualization, there are five crucial steps you should take to make sure you’re ready for anything.
1. Diagram your infrastructure.
It’s entirely possible that your company’s entire virtual infrastructure (network, storage, servers and virtualization software) was designed by someone else. In many cases, though, you are the person who has to troubleshoot it when something goes wrong.
By creating a diagram of your virtual infrastructure, you'll know what’s connected to what and where potential single points of failure lurk. (Here’s a hint: Check your storage area network or core switch for impending failure points).
If you’re running Exchange Server 2010 on VMware vSphere, you can use vSphere’s mapping tool to diagram your virtual infrastructure. Mapping capabilities are included in vCenter and a Maps tab is available on most objects of the virtual infrastructure (Figure 1).
Figure 1. VMware vSphere’s mapping tool can help you diagram your virtual infrastructure to use as a reference.
VMware vCenter mapping shows which ESX/ESXi hosts are connected to which storage data stores and virtual networks. You can also view what virtual machines (VMs) are running on a particular host.
Additionally, you can create a storage map showing individual ESX/ESXi servers or data stores (Figure 2). A storage map also lets you view back-end storage, including host bus adaptors (HBAs) or iSCSI storage area networks (SANs).
Figure 2. Create a VMware vCenter storage map to view host bus adapters or iSCSI storage area networks.
Another advantage of using the map functionality is that infrastructure maps can be exported into various formats, printed or saved offline in case a virtual infrastructure goes down. When a problem occurs, you’ll need to reference infrastructure maps as well as other documentation, including:
- Physical networks and port maps.
- IP address lists.
- Storage LUN inventory and configuration.
- Information on how to restore server backups.
- Access to server backup data.
2. Know your performance baseline.
Aside from knowing how your infrastructure has been set up, you need to know what normal performance is, so that you can detect any abnormalities. Do your servers usually run at 80% memory utilization? Or is 80% utilization low? If you don't know the difference between normal and abnormal, it will take much longer to sniff out a performance problem.
The vCenter performance trending charts are useful for documenting a baseline for your virtual infrastructure -- and you won’t have to spend money on a third-party benchmarking tool. If, for example, you had a vSphere Distributed Resource Scheduler (DRS)/High Availability (HA) cluster in place, you could navigate to the cluster level and click on the Performance tab, then Advanced view. This would allow you to create two custom charts showing average CPU and memory utilization over the past quarter or year, for example.
Check out where you'll find common virtualized Exchange breakdowns
As you can see from the graph in Figure 3, I usually run 7 GB to 11 GB of memory utilization in a two-host lab cluster. If I hit 20 GB utilization, I would know immediately that there was a problem.
Figure 3. A VMware vCenter memory utilization graph illustrates performance benchmarks and can hint to problems.
3. Know which Exchange server roles reside on virtual hosts and storage LUNs.
Keeping track of which Exchange server roles are running on specific virtual hosts and what storage logical unit numbers (LUNs) those hosts reside on is a matter of simple documentation.
I recommend assigning descriptive names to VMs -- EXCH-CAS, EXCH-HUB, EXCH-MBX1, EXCH-EDGE -- so you can quickly identify which role resides on which VM. Give data stores descriptive names too (i.e., EMC-CEL1-LUN12-20TB), so you can look at those VMs and also see which data stores are in use.
4. Implement load balancing.
DRS and vMotion are two of the most beneficial features of VMware vSphere. DRS uses vMotion to move running VMs from one host to another without downtime if the VM isn’t receiving the resources it needs. Placing all VMs -- including Exchange VMs -- in a DRS cluster ensures that they will perform as well as possible while also receiving necessary resources.
You’ve likely heard from many virtualization experts that Microsoft doesn’t support Exchange VMs using vMotion. However, many Exchange administrators successfully use DRS and vMotion with Exchange 2007 and higher, as well as VMware Infrastructure 3 without any documented problems. And Microsoft recently changed its stance on this, noting that perceived risks of running VMware HA with Exchange Server might be unfounded. Using DRS and vMotion ensures that Exchange VMs receive the resources they need, when they need them.
5. Practice high availability.
In addition to running DRS on your cluster, you should also enable HA on the cluster. It’s a simple step that can ensure automatic failover of Exchange VMs if a virtual host goes down.
Fixing the problems
If you’ve properly prepared your infrastructure for Exchange virtualization and something goes awry, solving the problem should take less time than if you went in blindly. Remember, trouble can -- and will -- strike at any time. Let's take a look at a few common ways Exchange virtualization can break and what to do when it does.
- Physical infrastructure failures.
- Power or cooling.
- Physical server.
- Network or storage.
Power and cooling failures are the most difficult to recover from. Without redundant cooling or a generator that can be brought online quickly, power and cooling failures can cause a total data center shutdown.
A physical server failure is one of the easiest to recover from. Most data centers run multiple virtual hosts in a cluster, so VMs can be restarted on another host with VMHA.
A network or storage failure can be catastrophic because both are required for the virtual infrastructure to function. Have redundant systems in place for both the network and storage.
To create redundant network systems, for example, you should have multiple network connections for each physical server that connects to redundant switches. A redundant storage system would have multiple iSCSI network connections to redundant network switches that are then connected to redundant storage controllers on a SAN. You should also use RAID 5, for example, to provide mirroring on disk in the event of a physical disk failure.
- Windows network infrastructure failures.
- Loss of Active Directory (AD) or DNS servers.
Windows network core services are critical to Exchange 2010 functionality and to VMware infrastructure. Without DNS or AD, several portions of a virtualized Exchange infrastructure would fail. Fortunately, it's easy to create multiple (redundant) AD domain controllers that also act as redundant DNS servers. In most enterprise networks, it’s uncommon to lose AD or DNS server connectivity, unless the entire network goes down.
- Virtual infrastructure failures.
- ESXi server crashed
- VM not responding.
- vCenter server crashed.
You’re more likely to experience a physical server failure than a crashed ESXi server. However, if you use VMHA and an ESXi server crashes, VMHA will move VMs on that host to another host and restart them.
VMHA has VM monitoring capabilities that enable it to restart unresponsive VMs.
A vCenter server failure will not affect Exchange VMs because vCenter isn’t crucial for daily infrastructure operation. Without vCenter, ESXi servers and VMs will function, but you’ll need to resolve the vCenter issue. Many times, you may need to restart SQL or vCenter services or reboot the vCenter server (VM) to resolve the problem.
- Exchange infrastructure.
- Exchange services are not responding or won't start.
Consult your documentation. No matter the source of the problem, you should have a documented troubleshooting method in place. There is no single method for all companies, but the process should be standardized within each company. Administrators should follow the same process to resolve problems, instead of randomly or haphazardly rebooting servers when something goes wrong. Thorough planning and documentation can curtail any failures and minimize outages.
Don’t hide the problems
Once you’ve resolved any glitches or failures within your virtual Exchange infrastructure, don’t pretend nothing happened. You should document the problem that occurred, how you were notified that there was a problem in the first place and how you fixed it.
Communicate this information to all administrators and to the proper IT managers so you’re sure that if the problem reoccurs, you can resolve it quickly -- or avoid it altogether. For example, even if you don't know why a particular Windows service failed, you may be able to initiate an automatic service restart. Additionally, if you don’t have advanced vSphere features in place like DRS and HA, now is the time to implement them.
About the Author:
David Davis is the author of the VMware vSphere video training library from Train Signal. He has written hundreds of virtualization articles on the Web, is a vExpert, VCP, VCAP-DCA and CCIE #9369 with more than 18 years of enterprise IT experience. Visit his website: VMwareVideos.com.