IT professionals understand resiliency is important for mission-critical applications and availability is usually achieved with well-established technologies like clustering.
A virtual data center can be far more sensitive to availability issues than a traditional one. After all, a traditional server fault may only impair one application, but a virtual server failure can cripple multiple virtual machines (VMs). It's a nightmare scenario for administrators and managers.
The hardware needed for a resilient virtual environment is almost identical to that used in a traditional setting. Here are some considerations and best practices for virtual server resiliency.
Apply resiliency that's appropriate for the workload.
A core attribute of virtualization is a workload can be moved between physical host servers on demand and also restarted from a snapshot on the storage area network (SAN). This kind of flexibility brings a new level of resiliency to the data center and allows administrators to rethink their approach to application availability.
For example, rather than leaving an application offline for hours or days awaiting server repairs or maintenance, IT administrators can migrate workloads from the troubled server and distribute them to other servers in the data center (or restart the workloads entirely) in just a few minutes. Once the work is complete, the workloads can be returned to the server.
Therefore, applications that in the past relied on cluster technology for non-virtual servers may be able to operate as single virtual workloads, lowering hardware investments and leaving cluster approaches for applications that cannot tolerate any disruption.
Use fault tolerant or high-availability tools to simplify clustering.
While traditional cluster architectures will work for virtualization, virtual resiliency is usually implemented more easily by creating a copy of a VM on another server. A tool such as EverRun from Marathon Technologies keeps the instances synchronized. When a fault occurs with one server and heartbeat signals are disrupted, the redundant VM steps in immediately.
The interesting issue with this approach is that each of the servers can be running workloads not related to one another. Of course, any workloads not protected with fault-tolerant tools need to be migrated or restarted.
Tailor the architecture to mitigate risks.
Resiliency often depends on eliminating single points of failure, and clustered servers generally implement redundant server components (e.g. redundant power supplies) along with LAN and SAN connectivity. The flexibility of virtualization makes these architectural choices a bit less formal but administrators can still tailor hardware deployments to achieve superior availability.
For example, important -- but not necessarily mission-critical -- VMs can be allocated to servers with redundant components or connectivity while other VMs may reside on servers with no redundant elements. In other words, there is no single rule for server or network architecture.
Monitor performance levels and balance workloads.
An application's availability is governed by its stability on a given computing platform and the computing resources available such as CPU cycles, memory space and I/O traffic. This is even more acute on virtual servers where multiple workloads compete for finite computing resources. A server needs adequate computing resources for all of the VMs it hosts otherwise one or more VMs may suffer poor performance, experience instability or crash outright. In extreme cases, the entire server may crash and disable all of the VMs on it.
Since every virtual machine demands a certain amount of CPU, memory and I/O resources, it's important for administrators to distribute workloads to prevent any given server from being overloaded. For example, multiple CPU-intensive workloads should be hosted on different physical servers rather than on the same server. Monitoring tools, such as vFoglight from Vizioncore Inc. can track server resource use, report resource shortages and help capacity planning.
Understand failover and failback behavior.
The flexibility of virtual machine migration can be problematic for server computing resources. Even when workloads are balanced, it's important for administrators to determine how VMs will failover to other servers. If these decisions aren't made in advance, failovers may occur automatically at the discretion of the virtualization platform and cause sub-optimal workload distribution. This can lead to poor application performance and instability. Use the virtualization platform to make failover and failback decisions for each VM.
Similarly, Windows administrators must ensure that redundant workloads (governed by fault-tolerant or high-availability software) don't wind up migrating to the same physical server. Virtualization technology allows the two redundant VMs to coexist on the same hardware, but that server becomes a single point of failure.
Test restoration plans.
IT administrators often take pains to implement backups, but never bother to check the restoration process. While a single VM can be restored to a server in just a few minutes, subsequent VMs may take increasingly more time.
This can happen when each VM resumes operation and begins using bandwidth, resulting in less bandwidth available to restore subsequent VMs. As a result, it may take several hours to restore 10 or 15 or more VMs to a server, which could violate service-level agreements (SLA) or other business needs.
Test restorations to an idle or spare server to see how long it takes. Excessive restoration time may require an administrator to redistribute workloads.
Manage VM sprawl.
Virtual machines are so quick and simple to create that a business may find itself inundated with them. This adds computing demands, lengthens backup time, complicates disaster recovery (DR) planning and leads to unnecessary hardware purchases.
IT administrators working with Windows must manage their virtual data centers by implementing policies and procedures to oversee the lifecycle of each VM -- who needs it, why it's needed, when it's created, how long it's needed, and when it should be retired. This slows the proliferation of new VMs and ensures IT can handle the additional demands.
Stephen J. Bigelow, senior features writer, has more than 15 years of technical writing experience in the PC/technology industry. He holds a BSEE, CompTIA A+, Network+, Security+ and Server+ certifications and has written hundreds of articles and more than 15 feature books on computer troubleshooting, including Bigelow's PC Hardware Desk Reference and Bigelow's PC Hardware Annoyances. Contact him at firstname.lastname@example.org.