As many of you may know from firsthand experience, the chkdsk command is a necessary evil Windows uses to ensure file system integrity. It is necessary because NTFS is not immune to file system corruption and uses the tool to fix transient and permanent problems such as bad sectors, lost files, missing headers and corrupt links. It is evil because chkdsk can take a long time to execute, depending on the number of files on the volume. It requires exclusive access to the disk, which means users could be waiting for hours, or even days, to access their data.
For additional reading
Use Microsoft PerfView to analyze process performance data
Chkdsk has evolved over the years just as disk drives continue to explode in size. Back in the mid-1990s with NT 3.51, a 1 GB disk was considered a large drive. Now, we have terabyte disks, combined with storage controller RAID functionality, that allows us to configure extremely large LUNs. As disks get larger, administrators leverage the capacity for more users per disk, which translates to more user files. Unfortunately, chkdsk does not scale well when analyzing hundreds of millions of files, so administrators are reluctant to use large volumes due to increased potential downtime.
Over the years, improvements have been made to hasten chkdsk's execution time. Switches have been added to chkdsk to skip extensive index and folder structure checking. Failover clusters can also be configured to skip running chkdsk when a dirty volume is brought online. But these improvements only mask the underlying problem: Scanning a large disk with millions of files takes a very long time. The table below shows approximate chkdsk execution times for major versions of Windows.
|Operating System Version||2 Million Files||3 Million Files|
|NT4 SP6||48 hours||100+ hours|
|Windows 2000||4 hours||6 hours|
|Windows 2003||0.4 hour||0.7 hour|
|200 Million Files||300 Million Files|
|Windows 2008 R2||5 hours||6.25 hours|
In Windows Server 2012 and in Windows 8, enterprise-class customers can finally have confidence when deploying multiterabyte volumes. Chkdsk has been redesigned to run in two separate phases: an online phase for scanning the disk for errors and an offline phase for repairing the volume. This was done because the vast majority of time spent executing chkdsk is spent scanning the volume, while the repair phase only takes a few seconds.
Better yet, most of the new chkdsk functionality has been implemented transparently so you won't even know its running. The analysis phase of chkdsk now runs as a background task. If NTFS suspects a problem in the file system, it attempts to self-heal it online. Errors of a transient nature are fixed on the fly with zero downtime. Any real corruption is flagged and logged for corrective action when it is convenient. In the meantime, the volume remains online to provide immediate access to your data.
Once every minute, the health of all physical disks is checked, and any problems are reported to event logs and management consoles, including the Action Center and the Server Manager. The corrective action usually involves remounting the drive, which takes just a few seconds. The amount of downtime for repairing corrupt volumes is now based on the number of errors to be fixed, not the size of the volume or the number of files.
Windows Failover Clusters using cluster shared volumes (CSVs) also benefit from the integrated chkdsk design by transparently fixing errors on the fly. Whenever any corruption errors are detected, I/O is transparently paused while fixes are made to repair the volume and then automatically resumed. This added resiliency makes CSVs continuously available to users with zero offline time.
The command line interface (CLI) chkdsk command is still available for fixing severely corrupt volumes. In fact, several new options have been added to support the new design, including /scan, /forceofflinefix, /spotfix and /offlinescanandfix. There is also a new cmdlet called repair-volume to offer the same chkdsk functionality with PowerShell. A brief description of the new PowerShell options is provided below.
|Repair-volume||PowerShell cmdlet that performs repairs on a volume|
|OfflineScanAndFix||Takes the volume offline to scan and fix any errors. Equivalent to chkdsk /f.|
|Scan||Scans the volume without attempting to repair it. All detected corruption is added to the $corrupt system file. Equivalent to chkdsk /scan.|
|SpotFix||Takes the volume offline briefly and then fixes only the issues that are logged in the $corrupt file. Equivalent to chkdsk /spotfix.|
Source: Microsoft TechNet
For example, if you suspect severe corruption with a particular volume, you can manually repair the drive by first scanning it to record any errors in the $corrupt system file. Then, when it is convenient to take the drive offline briefly, use the –SpotFix option to fix the errors:
PS C:\> repair-volume –DriveLetter T –Scan
PS C:\> repair-volume –DriveLetter T -SpotFix
For more information on the repair-volume cmdlet, use the command get-help repair-volume –full.
Windows Server 2012 has many improvements to increase the availability of your data. Now you can have very large disks with hundreds of millions of files and not have to worry about chkdsk slowing your boot time. While most of the new chkdsk functionality is implemented transparently, the CLI chkdsk tool and the new repair-volume PowerShell cmdlet provide administrators with the ability to fix volumes manually.
About the author: Bruce Mackenzie-Low, MCSE/MCSA, is systems software engineer with HP, providing third-level worldwide support for Microsoft Windows-based products, including Clusters and Crash Dump Analysis. With more than 20 years of computing experience at Digital, Compaq and HP, Bruce is a well-known resource for resolving highly complex problems involving clusters, SANs, networking and internals.
This was first published in October 2012