How replication solutions work
Fundamentally, the basic idea behind replication solutions is simple: when a change is made to protected data on a source machine, that change should be captured and replicated to a target machine somewhere else. The details of how this basic idea is implemented are where things get interesting. You can generally categorize replication solutions in two ways: hardware versus software and synchronous versus asynchronous. All of these solutions require a certain amount of network bandwidth between the source and target, but the type and amount required vary.
Hardware vs. software replication
Hardware-based replication systems are pretty much relegated to the world of high-end SANs because they perform replication at the disk-block level, and the replication itself is performed by the SAN controllers. When a changed block is written to a protected volume, the SAN controller captures the data and ships it to the remote SAN controller, which writes the changed block on the target replica.
This approach has some advantages -- notably that it happens well below the level where Windows and Exchange live. Assuming that the replication setup works properly, Exchange will never know that its underlying data are being replicated, which is a huge bonus. Of course, these solutions tend to be expensive because they require identical SANs on either end, and most vendors sell the replication capability as a separate option. On top of the initial hardware purchase cost, you must provide enough low-latency bandwidth to handle disk-speed replication, which might involve such exotica as long-distance runs of optical fiber and all the assorted paraphernalia that come with it.
One of the great things about the speed and power of modern CPUs is that we can use them to perform many tasks that would formerly have required specialized hardware. Data replication is no exception; software replication solutions work in much the same way as their hardware ancestors did, just without the hardware. These solutions can be implemented in multiple ways, depending on the software vendor's design goals:
- File system filter drivers use the Windows kernel mechanisms for hooking drivers into the file system; the driver's job is to watch any changes to a specified file (or folder) and copy those changes to the target
- Block-level drivers monitor changes to volumes at the disk-block level rather than at the file level; this setup has the advantage of letting the driver combine operations in much the same way that hardware replication solutions do -- the tradeoff is that these solutions normally require you to monitor an entire volume, which may not match up well with the disk topology you're using for your databases and transaction logs
Vendors of both types of solutions have implemented optimizations to avoid replicating data needlessly. For example, most products are smart enough to only transmit changes to files instead of retransmitting the entire file any time it changes; in addition, many products will aggregate changes whenever possible to reduce the amount of bandwidth required.
Speaking of bandwidth -- it's the limiting factor in how well software-based solutions work. Most of them are designed to work well over LAN or WAN bandwidth, which means that they normally implement both queuing and throttling. Updates are added to a queue on the source as they happen; when bandwidth usage permits, the update at the head of the queue is transmitted. This method helps to smooth the flow of updates between source and target. However, if the link between source and target is overloaded or goes down for a long period, the queue will fill up, preventing further updates until the link comes back up, at which point you'll probably have to resynchronize source and target.
Throttling is the other significant component of software-based replication; depending on the vendor, you can either specify a percentage of bandwidth usage or an absolute value. Either way, the replication controller is responsible for controlling how fast updates are sent. A useful feature to look for is the ability to vary the throttling limits by time of day so that you can change the amount of bandwidth at different times throughout the work day.
One of the functions of resynchronization is to allow you to bring a source and target that have diverged back into synchronization. For example, let's say that you wanted to do an offline defragmentation on the source machine. Because doing so will touch every database page, it would be pretty pointless to let replication run during the defrag. However, once it completes, you would need to resynchronize the source and target data.
Synchronous vs. asynchronous replication
The other key difference between replication solutions is how replication happens. Suppose that a particular data block (let's call it A) on the source server is written, followed by a write request for block B. In synchronous systems, the write request for block B on the source machine will be blocked until the source replication driver gets an acknowledgment that block A has been written to the target. This process has the advantage of guaranteeing that the source and target are in lockstep, as no further writes can be applied to the target until block A is written and acknowledged. The problem, though, is that this method can impose an unacceptable degree of latency on Exchange, which generally expects its write requests to take no more than 500 milliseconds. Thus, synchronous replication is normally the exclusive province of hardware systems, which essentially hide the increased latency by using their large caches to soak up the extra time required for synchronous acknowledgement.
Asynchronous systems decouple I/O on the source and I/O on the target. To revisit the earlier example, after block A is written on the source, it's queued for transmission to the target, and block B can immediately be written. The replication controller becomes responsible for making sure that blocks A, B, and so on are transmitted to the target, and the target's replication controller must ensure that the received updates are applied in the correct order. Asynchronous replication tends to be well-suited to WAN-based replication because it doesn't force the source system to suffer from transient increases in latency. Most software replication products default to asynchronous replication.
Replication: Pros and cons
Is replication a sensible solution for your requirements? It depends. On the positive side, replication can provide a means to duplicate your most critical data in an alternative location. This benefit can add significant redundancy. Software replication solutions tend to be pretty flexible; you can easily use them to replicate file server data, Exchange databases, Exchange transaction logs, or any other data that you want to protect.
On the opposing side, replication technology is not officially supported by Microsoft, which means you have to lean on your replication vendor for first-line support. In addition, both software and hardware replication require a significant amount of bandwidth for adequate Exchange performance. Hardware replication solutions also tend to be very expensive; software solutions are more affordable.
10 tips in 10 minutes: Fundamentals of Exchange Server disaster recovery
Tip 1: Defining Exchange disaster recovery
Tip 2: How Exchange backs up data
Tip 3: Choosing a backup type for Exchange
Tip 4: Online vs. offline Exchange Server backups
Tip 5: Basic Exchange backup and restore
Tip 6: Exchange vendor snapshots and point-in-time copies
Tip 7: VSS for Exchange
Tip 8: Exchange Server replication
Tip 9: Exchange design choices and issues
Tip 10: Exchange disaster recovery planning
This chapter excerpt from the free e-book The Definitive Guide to Exchange Disaster Recovery and Availability, by Paul Robichaux, is printed with permission from Realtimepublishers, Copyright 2005. Click here for the chapter download or download all available chapters here.