Troubleshooting a SAN is complex, but you can save yourself a lot of work if you do two things. First, verify that...
you have a SAN issue and not a generic storage issue. Second, begin the troubleshooting process at the center of the SAN so that you can quickly locate the general area of the problem.
When you're troubleshooting a SAN, you'll find that most problems aren't actually related to the SAN. Suppose you're suddenly unable to read data from the SCSI disk on your standalone PC. Several things could be causing the problem. The hard disk might have gone out. Maybe you've got a bad cable or a bad disk controller. Maybe the data on the drive has been accidentally erased, or the partition has been deleted or corrupted. Just because you can't access your data, it doesn't mean that a hardware failure has occurred.
Let's look at this same situation in the context of a SAN. A SAN is basically a way of linking a server to a logical device on a disk array or some other storage mechanism. The SAN works by allowing the server to communicate with the storage device using SCSI commands.
Suppose the server is suddenly unable to read data off the SAN. You may have a SAN problem, but the problem might be not to the SAN but to the data itself. It could be that connectivity is functional between the server and the storage unit, but that the data has been erased, corrupted or disassociated with the server. In that case, you'd troubleshoot the problem the same way you would if the storage mechanism were directly attached to your server.
But what if the SAN were the problem? Your best strategy is to start the troubleshooting process in the center of the SAN and work out toward the edges.
Step 1: Start troubleshooting at the fabric level. The reason for this is that the switches are located in the center of your SAN and should have connectivity to both the server and to the storage device.
Verify that the switch can communicate with the server and the storage device. If you can verify communications, you can rule out the fiber as being at fault. While examining the fiber, you should look for things like unstable links, missing devices, incorrect zoning configurations and incorrect switch configurations.
Step 2: Use diagnostic software to test switch connection. This will verify whether the storage device is connected to the switch. If not, you know the problem has to do with the storage device. It may be a physical connection issue between the switch and the storage device, or it could be that the storage software configuration is incorrect.
If the switch can communicate with the storage device, but the server can't, then you know that the problem lies somewhere between the switch and the server. This is why you start troubleshooting at the center of the SAN. A few simple tests and you eliminate half of the SAN as a possible cause of the problem (either the server side or the storage side of the network).
Step 3: If the problem lies with server and switch, check out these possible causes. If you do determine that the problem is between the server and the switch, you've got your work cut out for you.
Possible causes of the problem are a bad host bus adapter or a missing or incorrectly configured driver. The problem may also be related to the way that your server is configured to access the virtual storage device. You can start by using your hardware manufacturer's diagnostic utility. You can also run a protocol analyzer to verify that the network interface card (NIC) is functional and that the driver is working. If the NIC appears to be functional, then the problem almost has to be configuration related.
About the author: Brien M. Posey, MCSE, is a Microsoft Most Valuable Professional for his work with Windows 2000 Server and IIS. Brien has served as the CIO for a nationwide chain of hospitals and was once in charge of IT security for Fort Knox. As a freelance technical writer he has written for Microsoft, CNET, ZDNet, TechTarget, MSD2D and Relevant Technologies. You can visit Brien's Web site at http://www.brienposey.com