There have been many times over the years when I have been called to help a company recover from some kind of disaster. What I usually find is that the company either does not have a formal disaster recovery plan in place, or they do, but it has been completely forgotten in the midst of the chaos. I've also encountered several cases where the plan is totally irrelevant to the type of disaster that has occurred. In these types of situations, it is important to improvise and do what's necessary to get the most critical systems back online quickly.
Unfortunately, I can't tell you exactly what to do if disaster strikes. Every company is set up differently, and the size and scope of the disaster also has a lot to do with the steps that you would need to perform. For example, hard disk corruption and a hurricane could both be classified as disasters, but you would have to deal with them in completely different ways. If hard disk corruption occurs, it might result in a critical disaster, but, typically, it would only affect a single server. If a hurricane hits, you may not even have a building left, much less any servers.
Since there are so many different types of disasters, I instead want to offer some sample guidelines for getting your key systems back online as soon as possible and ensuring business continuity.
Step 1: Assess the damage
Naturally, you need to know the extent of the damage before you can go about fixing it. This means figuring out what -- if anything -- is still working. Only then will you truly know where the repairs need to be made.
Step 2: Prioritize your resources
I would hope you already know which servers and other components on your network are the most critical. If not, I recommend taking time soon to meet with your company executives and discuss which systems take the highest priority. You'll also want to know which systems are of the lowest importance. For example, the server that hosts the company's human resources database is important, but in a dire emergency, you could probably get by without it for a week if you had to.
Prioritizing resources doesn't just mean identifying the most critical systems, though. It also means figuring out who on your IT staff is available to help with the recovery and what their skill sets are. For example, the help desk staff may normally only help users deal with day-to-day problems, but those people do have IT backgrounds and may have skills that could come in handy for the recovery efforts.
Step 3: Prepare to bring the most critical systems back online
The most difficult step is probably figuring out the requirements for bringing your systems back online and deciding what resources are available that can help make that happen. When I worked in the corporate world, I always bought servers two at a time. That way, if a critical server failed catastrophically, I could get its twin out of the closet and use it to get back online. If you don't have a bunch of spare servers on hand, though, you may have to get a little bit creative.
One time I was working on a consulting project for a company, and a few days into the project a lightning strike severely damaged some of the hardware. In order to get back online as quickly as possible, I actually cannibalized a less important server for parts so that I could get a critical server back online fast. The server I sacrificed was less powerful than the more critical one, so the critical server was running at a reduced capacity when I brought it online. Even so, using this technique allowed me to quickly make a temporary repair that got the server online. I was then able to order the parts I needed to make the repair the "right way," and scheduled some downtime for the weekend to make the fix.
Step 4: Figure out how to reduce the recovery time
Once you have a plan for getting back online, consider whether there is anything that you can do to expedite the process. After all, the company is losing a lot of money for every minute those critical servers are down.
In situations like these, I have sometimes resorted to doing a multi-step restoration. This means restoring the most critical data first and then going back and filling in the gaps later on. For example, I have taken this approach with Exchange Server on several occasions. I did what's known as a dial tone restore, which means doing just enough to allow the users to send and receive mail, but that's it. Then, once the server was functional, I did a second restore operation that restored the messages that previously existed in the users' mailboxes, along with calendar entries, task lists and things like that. The important thing was that the users were able to get back to work quickly rather than waiting for me to restore a huge database.
Although this particular example is Exchange Server-specific, the technique works with just about any server. Just last week, for instance, I had to restore one of my own servers. To speed up the recovery process, I chose not to initially restore a folder on the server's hard drive that contained a copy of the Windows installation media and several service packs. This shaved about ten minutes off of the recovery time. Once the server was functional, I went back and restored the missing pieces at my leisure.
While it's nearly impossible to anticipate a disaster, it can often be just as hard to respond to one. Hopefully when the worst happens, these guidelines can help get your systems back online quickly.
ABOUT THE AUTHOR
Brien M. Posey, MCSE, has received Microsoft's Most Valuable Professional Award four times for his work with Windows Server, IIS and Exchange Server. He has served as CIO for a nationwide chain of hospitals and healthcare facilities, and was once a network administrator for Fort Knox. You can visit his personal website at www.brienposey.com.