Typically, learning about a server's history involves scanning the event logs. Unfortunately, event logs often contain either a limited amount of information or so much information that it becomes nearly impossible to track down what you are really interested in. Of course, if you are just trying to learn about the server's history, then it usually isn't worth the effort to spend hours going through the event logs.
Lucky for us somebody in Redmond feels our pain, as Windows Server 2008 comes with a utility called the Reliability and Performance Monitor. This tool, which is also included in Windows Vista, is designed to help you learn about a server's history and trace problems to specific events. In fact, I recently used it to troubleshoot some problems with one of my own servers.
You can access the utility by selecting the Reliability and Performance Monitor command from the server's Administrative Tools menu. When the Reliability and Performance Monitor opens, you will see a few different graphs showing you the server's current performance statistics, as shown in Figure A.
Figure A (Click to enlarge)
The performance data is nice, but if you are interested in learning about the server's history, navigate through the console tree to Reliability and Performance | Monitoring Tools | Reliability Monitor. When you do, you will see a screen similar to the one shown in Figure B.
The first thing that you will probably notice about this screen is the large graph at the top. The graph tracks various events and gauges the server's reliability over time. Notice the index in the upper right corner of the screen. This index rates the server's current reliability on a scale from one to ten.
As you can see in the figure, my server currently has a reliability index of 6.15, which really isn't so good. Take a look at the bottom portion of the screen though. This part shows you any events that have happened today that might affect the server's reliability. As you can see, though, nothing has happened today. If you look back at the graph, it shows that no significant events have occurred since April 21, 2008. Notice, however, that the server's reliability index was initially a ten, but there were quite a few events in April, which caused that reliability to be called into question.
Any time a significant problem is detected, the reliability index is decreased a little (or a lot, depending on the nature of the problem). Any time a day passes without any problems, the reliability index is slightly increased, meaning that the server has proven itself to be a little more reliable than the day before. Notice on the graph how the reliability index has been steadily increasing since April 21.
Since I captured the screenshots in this article from one of my own servers, I know that most of the problems in April were related to some faulty memory chips. But suppose I was a consultant seeing this server for the first time and I wanted to know more about the problems that had been occurring. In that case, I could simply click on one of the dates displayed on the graph to get information about that day's events. If you look at Figure C, you can see that the Reliability Monitor provides me with specific information about the failures that occurred on the day in question.
As I mentioned before, most of the events on this chart were related to some faulty memory chips. Notice that most of the problems seem to clear up around April 15, but then there are some more problems that happened on April 21.
The problems that occurred on April 21 were related to multiple power failures, but Windows doesn't know that. If you take a look at the graph, you will notice that there is a big dip in the reliability index from April 20 to April 21. In Figure D, you will see an event that occurred on April 20, but it was not an error.
The figure shows I installed some new device drivers on April 20. It was just a coincidence that the power failures occurred on April 21, but the Reliability Monitor displays a sharp drop in the reliability index as a way of drawing your attention to the fact that an event occurred the day before that might have affected the system's reliability. The tool is trying to show you that the drivers that were installed on April 20 could have been responsible for the failures that occurred the next day. In this case they weren't, but normally that would be something you would want to investigate.
ABOUT THE AUTHOR
Brien M. Posey, MCSE, has received Microsoft's Most Valuable Professional Award four times for his work with Windows Server, IIS and Exchange Server. He has served as CIO for a nationwide chain of hospitals and healthcare facilities, and was once a network administrator for Fort Knox. You can visit his personal Web site at www.brienposey.com.
This was first published in May 2008