Crash, boom, bang! Your Windows server just experienced a Blue Screen of Death (BSOD) and your helpdesk is being flooded with calls. The server is rebooting, but this is the fourth crash you've encountered this week and users are becoming unruly. To top it off, you now face spending hours on the phone, being passed around the world, with each vendor pointing to the other as the culprit.
It's time to take matters into your own hands. With a basic knowledge of crash dump analysis, and a few simple commands, you can determine which driver is involved. Then, by intelligently searching the Internet you can potentially locate a hotfix or workaround to resolve the crashes.
This three-part series will cover the tools and steps you'll need to tackle some of the toughest Windows server outages.
To begin with, the diagram in Figure 1 provides an overview of what happens when a crash occurs. As you can see, when the server crashes it writes the contents of physical memory (RAM) to the pagefile on the system partition. On reboot, the pagefile is written to the memory.dmp file, which also resides on the system partition. Finally, after the server reboots, you can then use the Windows kernel debugger (WinDbg) with Microsoft's symbol server to analyze the crash.
Three main areas need to be addressed to facilitate your crash dump analysis. First, the server must be configured to generate a crash when an unexpected condition or exception occurs. Next, you need to download the Windows debugger from Microsoft and set up the symbol server path. Finally, use the debugger to analyze the crash with a few simple commands. Now, let's take a closer look at each area.
Configuring the dump
To configure your server to generate a crash, use the Control Panel | System applet | Advanced tab | Startup | Recovery settings shown in Figure 2. You can choose from three types of memory dump files: small, kernel or complete. By default, Windows will produce a small, "mini-dump" file when the server crashes. This may sometimes contain enough debugging information, but typically a kernel memory dump file is required. In rare circumstances, it may be necessary to configure a complete memory dump to capture the required debugging information. Please see Microsoft KB article 254649 for additional information on configuring memory dump files.
Installing the Windows debugger
The next step is to install the Windows kernel debugger tool, which can be downloaded for free from Microsoft. There are three versions of the debugger (x86, x64 and IA64), depending on the architecture of the server where you plan to analyze the crash. Once WinDbg is installed, you must establish the symbol path to translate memory locations into meaningful references to functions or variables used by Windows. The typical symbol path used is SRV*c:\symbols*http://msdl.microsoft.com/download/symbols. See Microsoft KB 311503 for details on establishing your debugger's symbol path.
Analyzing the crash
Now that you have configured the server to generate a memory dump and installed the debugger with the correct symbol path, you are ready to analyze a crash. There are two ways to start up the debugger: from the program group "Debugging Tools for Windows" or from the DOS prompt with the WinDbg command. From within the debugger, use the File pull-down menu to "Open crash dump…" and point the debugger to your dump file.
When the dump file loads, you will notice the debugger's screen is divided into two regions: the output pane that occupies the majority of the window and the command prompt at the bottom. The first command to use is:
This command will perform a preliminary analysis of the dump and provide you with a best guess as to which driver caused the crash. The first thing the command shows you is the bug check type (also known as a stop code) and the arguments. The bug check type is very important and should be included with your query when you search the Internet for possible causes and fixes. As we see in the following example, WinDbg displays the bug check type as an LM_SERVER_INTERNAL_ERROR (stop code 54). In this case, if you searched the Microsoft website for LM_SERVER_INTERNAL_ERROR, you would find the known issue and hotfix documented in Microsoft KB 912947. Even the first argument matches the KB article.
3: kd> !analyze -v ***************************************************** * Bugcheck Analysis * *****************************************************
The !analyze –v command goes on to list which driver caused the crash. In our example, WinDbg accurately calls out the srv.sys driver that caused the crash:
Probably caused by: srv.sys (srv!SrvVerifyDeviceStackSize+78 )
Several other useful commands provide more information about the crash, including:
- !thread – lists the currently executing thread
- kv – displays the stack trace indicating which drivers and functions were called
- lm t n – displays the list of installed drivers and their dates
Finally, you should be aware that the Windows debugger's online help is excellent. In particular, you can look up the stop code for the crash and use the online help to recommend how to troubleshoot the issue. To find the list of stop codes, go to the Help pull-down menu and select Contents | Debugging Techniques | Bug Checks (Blue Screens) | Bug Check Code Reference. Then scan down the list to locate your stop code.
Many people think debugging a crash is better left for those with Ph.D.'s, but with a basic understanding and a few simple commands, anyone can get a leg up on identifying what is contributing to or causing a server crash. It is likely that someone else out there has already experienced the same crash, so a thorough Internet search will probably lead to potential workarounds or patches for the issue.
Join Bruce in part two of his series where he discusses how to identify which print driver is causing your spooler to crash or hang.
ABOUT THE AUTHOR
Bruce Mackenzie-Low, MCSE/MCSA, is a systems software engineer with HP providing third-level worldwide support on Microsoft Windows-based products including Clusters and Crash Dump Analysis. With more than 20 years of computing experience at Digital, Compaq and HP, Bruce is a well known resource for resolving highly complex problems involving clusters, SANs, networking and internals.