Troubleshooting a hung or nonresponsive Windows server can be a challenging endeavor. Simply hitting the reset button is no longer a tolerated option as more companies use these servers for business-critical operations. This three-part series will explore the reasons why a Windows server may hang and provide a cookbook approach to diagnosing the underlying issues with the Windows Kernel Debugger (Windbg).
When Microsoft released the early versions of its server operating system (Windows NT 3.5x and NT4), there was no easy way to troubleshoot a hung server. Other mainstream operating systems, such as Digital Equipment Corp.'s VAX/VMS, offered ways to manually intervene by forcing a crash dump whereby the server's state could be captured at the time of the hang. This dump could then be analyzed to determine why the server hung. The only option for early Windows platforms, however, was to reset the box.
As Windows servers became more predominant in the business world, hitting the reset button became unacceptable. As a result, in Windows 2000 Server and later versions, it became possible to force a crash dump to assist with determining why the server hung. Microsoft introduced this feature in Knowledge Base article 244139. It allows a keystroke combination (right CTRL+SCROLL LOCK twice) to generate a crash dump on PS/2-type keyboards. Microsoft extended this feature in Windows Server 2003 with a hotfix to the Kbdhid.sys driver to accommodate USB-type keyboards.
Several other options now exist to force a crash dump. Microsoft provides the Windows Special Administrative Console (SAC) Crashdump command as part of Windows Emergency Management Services (EMS), which allows for "headless" servers with no local graphical console. Vendor-specific options also exist to force a crash dump including the HP Integrity server's Management Processor TC (transfer of control) command, an NMI (non-maskable interrupt) button on some Integrity models, or the Integrated Lights Out (iLO) virtual NMI button. We'll take a closer look at each of these options later in the series.
Why a server hangs
There are a variety of reasons why a server may hang, including both hardware and software issues. The most common hardware reason for a server hang is spurious interrupts by a failing device. For example, a network interface controller (NIC) may have a bad component or be attached to a bad cable causing false interrupts to occur. These interrupts occur at an elevated interrupt request level (IRQL) dominating the attention of the processor(s), leaving lower priority requests (user level) unanswered. As a result, the server appears to be hung.
Another example of a hardware-induced hang involves storage requests going unanswered. For example, consider a case where a disk drive fails, causing outstanding I/O requests to be queued up. Eventually, these pending requests trigger a cascading effect of user and system threads to hang, leading to a system-wide outage.
More often, however, server hangs are a result of software issues. These issues come in several flavors, including:
System resource depletion (e.g., out of memory pool) -- The most common type of software hang, this typically is the result of a memory leak by a driver or kernel mode thread. Resource depletion can also result from exceeding architectural limits of paged and nonpaged memory pools (typically experienced on an x86 32-bit operating system).
Deadlock conditions -- A deadlock occurs when contention exists for common resources between two or more threads. For example, a deadlock exists when one thread owns an exclusive lock on a resource that another thread wants, and that thread exclusively owns a resource that the initial thread wants.
Spinlock conditions -- Spinlock hangs are similar to deadlocks, but involve contention for a spinlock that is used to synchronize access to data structures in a multi-processor environment. Other permutations of these conditions include a driver holding a lock while performing other activities for an extended period of time. Actual examples of deadlock and spinlock hangs will be provided later.
High-priority, compute-bound threads -- A software hang can also occur if high-priority, compute-bound thread(s) are dominating the processors. Since the Windows operating system permits varying levels of thread priority, one or more threads may execute at a higher priority than typical user threads. The result is that applications and users at normal priority are starved for CPU time, causing a perceived software hang.
The big picture
So, as you can see, there are numerous reasons why a server may hang. To give you a better idea of what happens when you force a crash to generate a memory dump, and subsequently analyze the crash to determine what caused the hang, see Figure 1 below.
Starting on the left-hand side, you can see the server crashes or hangs. In the event of a crash, the server would generate a memory dump if the dumpfile and pagefile are properly configured (see Microsoft Knowledge Base articles 254649, 197379 and 889654).
In the event of a hang, manual intervention would be required to force a crash dump as previously described. In either case, the content of memory is written to the pagefile.sys before the server is rebooted. During the reboot, the pagefile.sys is written to the memory.dmp file. Finally, once the server has rebooted, you can use the Windows Kernel Debugger (Windbg) to analyze the memory dump using a symbol server (as documented in KB article 311503) to translate memory references to meaningful functions and variables.
Figure 1: Overview of memory dump process and analysis
Now that you have a better idea of why server hangs occur, the next article in this series will look at the preparation process for troubleshooting a hung Windows server.
TROUBLESHOOTING A HUNG WINDOWS SERVER
- Part 1: Why do servers hang?
- Part 2: Preparing to troubleshoot
- Part 3: Resolving the issue
ABOUT THE AUTHOR
Bruce Mackenzie-Low, MCSE/MCSA, is a systems software engineer with HP providing third-level worldwide support on Microsoft Windows-based products including Clusters and Crash Dump Analysis. With more than 20 years of computing experience at Digital, Compaq and HP, Bruce is a well known resource for resolving highly complex problems involving clusters, SANs, networking and internals.