Manage Learn to apply best practices and optimize your operations.

How to design monitoring controls to manage mistakes

Plagued by calls from angry users? Using workarounds regularly to do your job? Then it's time to design and implement controls that will address potential failures.

Russell Olsen
Russell Olsen
You have had the phone call. You know the one – it typically comes on a Friday afternoon at the end of an easy week. The call goes something like this:
You: Hello.
Angry user: Why isn't X working?
You: I'm sure there is a logical reason (blood pressure begins to climb).
Angry user: I don't care about logic. I want X    working, like, yesterday!
    You: I'm on it. I'll report back to you as soon as I    have figured it out (all-nighter commences).

   You (next day without sleep): Joe forgot to check    the box on X. I checked the box, and now    everything is working.
   Angry user: It doesn't make sense to me why you    missed checking the box. (all trust and    creditability is now lost).

Why wasn't the box checked? Was it ever checked? Did someone uncheck it? Who did or did not do this? Most likely you will never really know.

No matter how well planned out, how many test scripts are run or how many times you explain it, something always goes wrong. Just like death and taxes, problems are inevitable, but you can avoid those phone calls.

Go back to that nice relaxing Friday, but this time start in the morning and pull up your daily monitoring report and see that a critical box has been unchecked by Joe. You follow up with Joe and find out it was done for a good reason, but Joe forgot to recheck the box when he was done.

The difference in the two scenarios is timely information on critical configuration. So how do you get that to happen? Monitoring controls. The first step to having monitoring controls is to have an appropriate design. I use the following methodology when designing monitoring controls:

  • Key processes
  • Failure analysis
  • Identification of control points
  • Key processes

    Everyone has some type of defined area of responsibility.

    More problem management strategies

    Use SLAs to assign help desk incident priorities

    Use Microsoft Service Desk to improve incident management
    If you can't define your areas of responsibility, stop – you have bigger problems, and you aren't ready for monitoring. The purpose of this phase is to list any major processes that fall under your responsibility. Say you are responsible for the Windows environment, for example, for server maintenance and administration. In that case, you would most likely have the following key processes:
  • Privileged user administration (local system administrator access)
  • Server configuration (GPO, services, and so on)
  • Server patches and upgrades
  • Having high-level bullet points of responsibility is a great tool. It will help you in so many ways, from managing your priorities to disaster recovery to monitoring for errors.

    Failure analysis

    The next step in the design process is to take a deeper look at each of your key processes. The goal is to ultimately identify where things are likely to fail and cause you problems. This is done by taking a process and asking a few questions.

    Let's analyze privileged user administration. Here are the questions I would consider when performing a failure analysis:

    1. Do I have a standard process? If I don't have a standard process for provisioning – and de-provisioning – administrator access, how can I ever feel comfortable that the right people have access?

    2. Is there a "workaround" you frequently use? None of us will publicly admit this, but we all have workarounds – ways to override the standard process – to get things done. I'm not saying to get rid of these, but you have to be aware that they exist.

    3. Where is this process likely to break? And if it breaks, what is the impact – maybe you don't have a good way to reset all of the admin passwords? Have you considered that your local system administrators probably have full access to any MS SQL databases that reside within your server farm.

    You are likely to come up with several more questions that will help lead you to your final list of all the areas in which your process is likely to break and result in a problem.

    Identification of control points

    Armed with your key processes and your points of failure, you can now go through the last exercise in control planning – identifying your control points. A control point is something that gives you that warm fuzzy feeling. Knowing that it is in place and working allows you to trust the system.

    To get a clear picture of where your control points are, look over your failure analysis. You should have a quick response to each of the weaknesses. Maybe you have a weak provisioning process with lots of workarounds. If you have a program that logs administrative login or sends an alert every time someone logs into the box with a local account, then you can still feel comfortable with having a secure environment.

    If you are like most other IT managers, you will have identified some gaps at the end of this process that you don't have controls for. There is no time like the present to design and implement controls to address the potential failures.

    With a clear picture of your process, failure and control environment, you can now look forward to the next step in the process – implementation of control monitoring. Here you can pull out your WMI scripts, Active Directory queries and other tools.

    Russell Olsen is the CIO of a medical data mining company and previously worked for a Big Four accounting firm performing technology risk assessments. He co-authored the research paper "A comparison of Windows 2000 and Red Hat as network service providers." Russell is a CISA, GSNA and MCP.

    This was last published in October 2007

    Dig Deeper on Enterprise infrastructure management

    Start the conversation

    Send me notifications when other members comment.

    Please create a username to comment.