Small problems can turn into large ones in Exchange environments that aren't regularly monitored, causing the system to deteriorate quickly to the point of outage or even total failure.
There are a few areas to watch to prevent outages in single-server Exchange environments and large enterprises with multiple servers. Here are three of the most common causes of extended Exchange outages.
A failed recovery is the most common cause of extended Exchange outages. It may sound like circular logic, but if the Exchange environment is down for multiple days, the root cause of the failure is no longer relevant. Don't let indecision and untested processes fuel a crisis. Every Exchange shop needs to have a detailed plan to recover each of the following: single mailbox, single database, single server and the entire environment.
While there are a number of third-party products that handle disaster recovery, tools and processes included with Microsoft Exchange and Windows Server are good options because Microsoft offers support and documentation for different disaster scenarios. Microsoft provides guidance on how to restore the following -- a single mailbox from a database backup, an Exchange Server, a DAG Member Server and dial tone portability, which can solve failures of a mailbox database, server or entire site.
Use these procedures regularly to understand the process and test backups. The processes to restore a database and a single mailbox are not invasive; administrators can perform these procedures on live servers. It's best to perform these on weekends and after hours to minimize the potential effect on end users.
Resolve server performance
Users and customers perceive server disconnections, frozen Outlook sessions and delivery delays as Exchange outages. Server failures can cause these symptoms, but the usual suspect is poor server performance. Administrators can save users and customers from suffering through prolonged performance problems by following this abridged cheat sheet for regular monitoring and crisis remediation.
Detect and measure latency: Use the Windows Performance Monitor to get familiar with Exchange performance counters. Don't try to watch for everything right away; track a few of the client latency counters first. These will help measure what is happening to clients:
- MSExchange RpcClientAccess\RPC Averaged Latency
< 250 ms is good
- MSExchange RpcClientAccess\RPC Requests
< 40 ms is good
- MSExchangeIS Client Type(*)\RPC Average Latency
< 50 ms is good for each client type
Measure storage, CPU, memory and network: If we confirm measured latency, dig in to find the cause of the delays. Start small and with the most common problems and then work up from there.
- Logical Disk\Avg. Disk Queue Length\<All Instances>
<1 Per Spindle is good
This is a solid go-to counter for disk problems. While it does not reveal the nature of the problem, this counter tells the admin if there is a problem. A counter that regularly remains above 1 for a few seconds or more is on the border of poor disk performance. If this counter spikes to 10 or more and remains there for longer than a few seconds, the administrator needs to address a disk performance problem.
Look at the drive with the sustained queue length and determine what is located on that drive: Pagefile, Exchange Database Logs, Exchange .EDB files, SMTP queue and so on. After identifying what is on that disk, dig deeper with other disk counters to measure latency and IOPS. To correct the problem, move the service/data to a faster drive or determine if there is a hyperactive service or failing hardware.
- Processor(_Total)\% Processor Time <75% on Average
- Be sure to look for spikes during peak times. The daily average means very little.
- Memory\% Committed Bytes In Use <80% is good
- Network Interface(*)\Packets Outbound Errors =0 is Good
Measure Active Directory, .NET and Netlogon performance: If hardware counters fail to determine the delays, look deeper at other factors that affect Microsoft Exchange.
Download the latest Exchange Server Role Requirements Calculator to ensure the hardware selection is appropriate for expected and measured loads. This is an excellent tool to measure size, but it does not account for things such as Skype user loads, effects of backup and third-party antivirus loads, Good/Blackberry impact and other tools. Be prepared to add hardware as needed to support secondary services.
Storage, storage, storage
Administrators need to keep an eye on storage to ensure Exchange runs properly. Without enough storage space, Exchange will dismount a database and leave end users without access to mail.
Monitoring storage capacity is critical. Exchange 2013 and Exchange 2016 use ServerHealth sets to detect and report on disk sizes when they get low. Run the following command from the Exchange Management Shell to monitor storage space:
Get-ServerHealth -Identity <ServerName> -HealthSet `MailboxSpace' |ft -auto identity,targetresource,healthsetname
Look for something similar to the output below, specifically the bottom row. Once HealthCheck detects a drive is low on space, the event is noted in the Application Event Log.
This is a good safety net, but it's best to use an active report webpage to monitor storage growth if there is concern about missing these alerts. I recommend using the free script that Steve Goodman wrote.
Administrators can configure Task Scheduler to run the script every night; an internal IIS within the environment can display the results. Once set up, it only takes a glance to know the storage situation on the Exchange Server.
If possible, keep at least 15% free space on the database and log drives. Deleting mail only creates white space in the databases, it doesn't decrease the database size. Be prepared to extend drive partitions or add larger drive partitions and move data.
Additionally, don't rely on offline defragmentation to reduce database drives. The process takes far longer than many people expect and new storage will be needed to spool the files during the defrag process.
If an admin takes some time and regularly monitors for client latency and storage capacity and practices these recovery procedures, the business will be in great shape to remediate problems and respond to crises with confidence and experience.
Examining why data centers go down
Avoid storage problems in Exchange 2013
Calculating resource requirements for Exchange 2013