Manage Learn to apply best practices and optimize your operations.

Testing your IT disaster recovery plan: Learn from your mistakes

Some of the major steps required when testing an IT disaster recovery plan include using existing projects as an example and learning from your mistakes. Disaster recovery expert Russell Olsen breaks down the details.

Russell Olsen
Russell Olsen
There are four major components in an IT disaster recovery plan test. Two of them are using existing projects as an example and learning from your mistakes. Previously, we discussed the other two activities, identifying source files and doing a verbal walkthrough of the plan.

Use existing projects

In every department there will always be initiatives that allow you to incorporate live disaster recovery plan testing into the project.

More on IT disaster recovery planning:

Part 1: Integrating DR plans into daily operations

Part 2: Charting a disaster recovery plan

Part 3: The disaster recovery execution methodology

Part 4: Testing your disaster recovery plan

Part 5: More testing of your disaster recovery plan

Part 6:  Integrating mobile devices into your DR plan
Open or upcoming projects create opportunities to carry out a quick disaster recovery check. Often, these projects come about when new hardware is being installed. New hardware means two things: (1) Someone will be configuring a server from scratch (providing a perfect time to test your rebuild procedures) and (2) an old server is being decommissioned, which presents an opportunity to test your backups and rebuild your server image.

Let's say your company has recently acquired another company and you will be migrating their users and computers into an existing organizational unit (OU) in your domain.

If you plan the project with the appropriate rollback procedures, you could test an authoritative restore of the original OU to make sure your migration procedures are adequate. It might add a few days to your project, but performing a live test (that wouldn't impact operations) would be worth a few extra days.

The key to this method of disaster recovery testing is to take advantage of what you are already doing. By adding a step here and there, you can effectively test your procedures without costly offsite testing.

Learn from your mistakes

Too often we think that when performing an IT disaster recovery plan test, we must plan the test and control the input data in order to have confidence in the output. I have found that the best disaster recovery procedures are found in our daily mistakes.

Every day things go wrong -- a new domain administrator deletes an OU, the Microsoft SQL Server DBA forgets to back up the database schema (which you discover after a production database was dropped), faulty hardware causes your exchange server to go down, and so on.

Take the opportunity to do a post mortem after you experience these daily problems and ask the following questions:

  • When the problem occurred, was the disaster recovery book used to help remedy the situation?
    If the answer is no, follow up with two additional questions: Was the resolution contained within the scope of the DRP? If not, evaluate how relevant your plan is. If it would only help out when there's a complete loss of all systems, it is probably so broad in scope that it won't get you the detailed information you need to handle minor disasters.

    If yes, why wasn't it used? Your disaster recovery plan should be a familiar document to all members of the team. You never know who will be present to help recover after a disaster.

  • How was the problem identified?
    This is important because it helps you understand the formal and informal practices that your team uses to monitor operations. You can make sure that small disasters don't become big ones with appropriate monitoring controls.
  • How long did it take to fully recover from the issue?

    The "how long" question extrapolates how much time it might take you to recover from a larger disaster and if it is an acceptable time frame. If it took three people two days working 16 hours to recover from an accidental deletion of some Active Directory Objects, you need to know if 96 FTE hours was too slow. How many hours would it take if multiple domain controllers at a single site were lost? Would that be acceptable?
  • Every organization and department will have additional questions to ask, but the idea is the same whether you manage the Active Directory group or the Microsoft SQL Server team: Get the most out of daily events. Lots of small disasters can equal one fatal error or it could be business as usual -- depending on how they are handled.

    The concept of IT disaster recovery plan testing can be a large production at an offsite location, but if you leave it as a once a year event, you'll miss the true concept. Testing daily IT events holds so much value because they offer what a large production can't -- the unexpected. Ask questions, involve the team through verbal testing, take advantage of existing events and evaluate unanticipated errors. You'll be making huge strides in your preparation for a larger disaster.

    Russell Olsen is the CIO of a Medical Data Mining company. He previously worked for a Big Four accounting firm. He co-authored the research paper "A comparison of Windows 2000 and RedHat as network service providers." Russell is an MCP and GSNA. He can be reached at

    Dig Deeper on Enterprise infrastructure management

    Start the conversation

    Send me notifications when other members comment.

    Please create a username to comment.