ra2 studio - Fotolia

Manage Learn to apply best practices and optimize your operations.

The inevitable Office 365 outage and your recovery plan

An IT outage is an unfortunate occurrence that can wreak havoc on an enterprise. Before an outage happens, take steps to minimize interruptions to daily operations.

Glitch. It's that dreaded yet inevitable outage, gaffe or unforeseen incident that brings your enterprise to its knees. IT downtime seems to happen more frequently these days, especially as it relates to the cloud. Take, for instance, the 2013 and 2015 Office 365 outages. In December 2015, Europe-based Microsoft customers experienced an Office 365 outage that took down numerous services including Active Directory Federation Services, Outlook Web Access and even the service health dashboard.

The business implications of IT outages are obvious, but there can also be unintended consequences that damage security, such as well-meaning employees working on their personal email systems and file sharing services, or even copying work locally to their unencrypted laptops in hopes of getting work done off-site while the system is down.

Someone else's error becomes your testing time.

Word has it that human error related to Azure Active Directory caused the most recent four-hour Office 365 outage. Getting information on the actual root causes of such incidents always seems difficult. However, in my experience, it's rarely ever the software or the hardware that creates the problem. Instead it's the person -- or persons -- involved. From something as seemingly simple as pushing a small firewall rule change that ends up taking down a highly visible e-commerce site to configuration errors that lead to the incorrect routing of traffic as in this recent Office 365 outage, there's always a common factor: people. The term glitch is convenient and essentially serves as a distraction to excuse failures that people won't admit to.

Regardless of the responsible party for situations such as Office 365 outages, the onus is on you as the administrator to minimize the effect on your organization. In fact, someone else's error becomes your testing time. You have to demonstrate skills in contingency planning and in bringing a critical business application back online. Your users are not going to care that it's Microsoft's or another entity's problem, especially when you were the one who decided to use that service in the first place.

An Office 365 outage isn't a fun situation. The reality is, as Jim Rohn once said, failure is not a single cataclysmic event. You don't fail overnight. Instead, failure is a few errors in judgment, repeated every day. To soften the blow of an IT outage, start working toward a contingency plan. It can provide prescriptive guidance so you don't have to make ad hoc or rushed decisions in the middle of a crisis.

Form a plan for IT outages

The following are questions you need to answer -- along with your peers in IT and security as well as others in management -- so you can have a plan to fall back on in advance of an actual outage occurring:

  • What do we consider a true outage? For example, an IT outage includes when Exchange in Office 365 becomes unavailable, or only when the whole platform goes down.
  • What will be the impact to our business? For example, the business loses communication with customers or vendors. What are the consequences of that event?
  • How can we get by? Are there systems or related services we can fail over to? Be sure to include the backup plan, additional applications that may come into use, or even locations that can be relied upon.
  • What do we have at our disposal locally that we can rely on in the interim? This could range from alternative corporate communications systems to text messaging and personal email accounts.
  • Who will we need to contact? Who else will need to get involved? Is the problem referred to a systems integrator, for example, or Microsoft directly for support?
  • What's our absolute worst-case scenario? Think about this for services that could go down, duration of the event and its timing and the extent of those affected. 
  • Is there a technology product or service we should put in place today to minimize the detrimental effects of an Office 365 or other outage? One example is to scout additional vendors for backup resources, or look into third-party add-ons for Office 365.

In today's world of computer glitches, advanced security threats and environmental unknowns, carefully crafted service-level agreements, contracts, and policies mean nothing. We know what can happen. The real question is: what are you going to do about it?

Next Steps

Protect your Office 365 data

Why organizations should use cloud monitoring tools

How to deal with an Office 365 outage

Dig Deeper on Office 365 and Microsoft SaaS setup and management