Although the actual disk fault management process will vary between organizations, depending on the policies, tools...
and personnel expertise available, there are some common elements of the disk replacement process that Windows administrators can follow.
First, you need to identify the faulty disk. Windows Server 2012 R2 provides several resources for disk fault and identification data including Event Viewer logs, through the Physical Disks report in Server Manager, through an alerts dialog in System Center Operations Manager (SCOM) or through Windows PowerShell queries. Where tools such as SCOM can report the specific location of a disk fault -- slot, tray and position -- other tools report a disk failure as a physical disk number or globally unique identifier (GUID). GUIDs can be translated into physical disk numbers using PowerShell Get-PhysicalDisk commands.
After determining which disk has failed, find it in the storage array enclosure. Many storage arrays provide LEDs that blink when a corresponding disk fails. If not, technicians will need extra time to find the correct physical disk or serial number.
Next, many technicians will first check the disk connections by attempting to reseat the troubled disk in its slot or cable connections. If this works, clear the blinking LED by resetting the physical disk use or removing the disk from the storage pool through a PowerShell PhysicalDisk command. If disk problems persist, replace the disk using the instructions for the particular storage array. Typical best practice states the new disk's characteristics should match the failed disk to prevent performance mismatches that might cause storage problems later. Replace the physical disk before removing the disk from any storage pool configuration. Give the new disk a chance to rebuild otherwise there may be data loss.
Make sure that each identical disk in the group or array is using the same firmware version. Once the new disk is in place, update its firmware to the latest accepted version used on other disks in the group or greater array. Remember that each new firmware version can introduce changes in timing and access. While this should improve the disk itself, firmware version differences can also introduce performance differences that might trigger unexpected or intermittent storage errors. Tools such as Server Manager or Windows PowerShell can report on disk firmware versions, and updates should follow the disk manufacturer's instructions.
At this point, use Server Manager or Windows PowerShell to add the new physical disk to the storage pool, and then retire and remove the old disk from the storage pool. In the event of a complete disk failure, the failed disk should have been retired automatically. If the disk is being replaced pre-emptively -- such as in response to intermittent problems -- retire the disk first through PowerShell.
As a final step in disk fault management, technicians can run a storage health test to verify the storage pool or cluster, and then dismiss any alerts.
Tips to stretch drive longevity
Failures can be the best training exercise
Techniques to handle server issues
Dig Deeper on Windows Server storage management
Related Q&A from Stephen J. Bigelow
Regression tests and UAT ensure software quality and both require a sizeable investment. Learn when and how to perform each one, and some tips to get... Continue Reading
Learn the meaning of functional vs. nonfunctional requirements in software engineering, with helpful examples. Then, see how to write both and build ... Continue Reading
Just because software passes functional tests doesn't mean it works. Dig into stress, load, endurance and other performance tests, and their ... Continue Reading