As we continue our series on tackling the toughest Windows server outages, the time has come to explore the different tools and techniques used to track down Windows memory leaks.
As you may know, memory leaks are caused by poorly written applications or drivers that allocate memory and then subsequently fail to de-allocate all of it. After time, this can lead to the depletion of system memory pools (paged or non-paged) causing the server to eventually hang.
Long before a Windows server hangs though, there are typically other symptoms of a memory leak. The main things to watch out for are entries in the system event log from the server service (SRV component). In particular, be on the lookout for:
Event ID 2019: The server was unable to allocate from the system nonpaged pool because the pool was empty
Event ID 2020: The server was unable to allocate from the system paged pool because the pool was empty
These two events are indicative of a Windows memory leak and need to be investigated immediately. Other signs of a memory leak include excessive pagefile utilization and diminishing available memory.
The first tool typically used to diagnose memory leaks is Perfmon, a graphical tool built into Windows. By collecting performance metrics on the appropriate counters, you can determine whether the memory leak is being caused by a user process (application) or a kernel mode driver. The performance metrics can be collected in the background with the counters being written to a log file. The log file can subsequently be read by Perfmon or the Performance Analysis of Logs (PAL) from CodePlex. Microsoft KB article 811237 explains how to setup Perfmon to log performance counters. There is also a free tool called PerfWiz from Microsoft which provides a wizard to help setup Perfmon logging.
If you suspect a user mode application is leaking memory, you can use Perfmon to collect the Process object counters, Pool Paged Bytes and Pool Nonpaged Bytes for all instances. This will display whether any processes continue to allocate paged or non-paged pool, without subsequently de-allocating it. If you suspect a kernel mode driver is leaking memory, use Perfmon to collect the Memory object counters, Pool Nonpaged Bytes and Pool Paged Bytes.
In the following example, Perfmon is being used to monitor performance counters for the memory object, namely paged and non-paged pool. By right-clicking each counter, you can adjust the scale to have both counters appear on the same graph. As you can see in Figure 1, the Pool Paged Bytes counter (red line) continues to grow without decreasing, meaning it is leaking memory. Looking at the minimum value for the paged pool counter, it appears it has gone from a value of 118 MB to a maximum value of over 350 MB.
So at this point in our example, we know we have a paged pool leak. We can then use Perfmon to examine the Process object for Pool Paged Bytes. If no processes show a corresponding increase in paged pool usage, we can conclude that a driver or kernel mode code is leaking memory.
To further isolate the memory leak, we need to determine which driver is allocating the memory. When drivers allocate memory, they insert a four-character tag into the memory pool data structure to identify which driver allocated it. By examining the various pool allocations, you can determine which drivers are responsible for allocating how much pool. To associate which tags correspond to certain drivers, see Microsoft KB article 298102. You could also install the Debugging Tools for Windows and check the following file:
\Program Files\Debugging Tools for Windows\Triage\Pooltag.txt
The Memory Pool Monitor utility (Poolmon) is a free tool from Microsoft that will watch pool allocations and display the results illustrating the corresponding drivers. In the following example, Poolmon is being used to track the leaking pool tag "Leak" at the top of the list. Poolmon shows the number of allocations, number of frees, the difference, and the number of bytes allocated. Poolmon will also show the name of the driver if it is setup properly.
Here we can see the tag "Leak" belongs to the Notmyfault.sys driver and has over 83 MB of paged pool allocated.
If all else fails and your server locks up completely due to a memory leak, you can always force a crash dump and subsequently analyze it as discussed in my previous article on why Windows servers hang. The key things to look for when analyzing the crash with the Windows Kernel Debugger (Windbg) utility are the memory pool usage and which data structures are consuming the pool.
The first command to use in the debugger is !vm 1, as seen in the following example. This command will display the current virtual memory usage, in particular the non-paged and paged pool regions. The debugger will flag any excessive pool usage and any pool allocation failures as shown in Figure 3. The trick is to compare the usage with the maximum as highlighted in yellow below. If the usage is at or near the maximum, then the server hung because it ran out of pool.
Finally, you can use the debugger to display the paged or non-paged pool data structures with the !poolused command. Various options on the command allow you to specify either paged or non-paged pool and sort the output. In the following example, the !poolused 5 command is used to display the paged pool data structures, sorted in descending order by usage. In Figure 4, you can see the pool structure with the tag "Leak" is consuming the most paged pool (over 115 MB) and is associated with the notmyfault.sys driver.
As you can see, using tools such as Perfmon, PerfWiz, PAL, Poolmon and Windbg, you can monitor the memory leak, determine whether it is paged or non-paged memory, and discover what driver or application is responsible. After that, contacting the software vendor is usually the best option to see if they have an updated driver or image available that resolves the memory leak.
ABOUT THE AUTHOR
Bruce Mackenzie-Low, MCSE/MCSA, is a systems software engineer with HP providing third-level worldwide support on Microsoft Windows-based products including Clusters and Crash Dump Analysis. With more than 20 years of computing experience at Digital, Compaq and HP, Bruce is a well known resource for resolving highly complex problems involving clusters, SANs, networking and internals.