Deduplication is in the box in Windows Server 2012 and Windows Server 2012 R2. This technology looks at a volume,...
finds content that is repeated and removes all but one copy of that content -- and it replaces the areas it has removed with "links" back to the single copy of the chunk that remains.
It is a very effective technique for reclaiming available storage space from volumes that have a lot of content which is not much different than other content; think ISO libraries, file shares with a lot of versions of the same files (like for revision tracking) and one other application you might not think of: VDI deployments. These virtual machines that make up your VDI deployment contain near exact copies of thousands of operating system libraries and executables, even when you use differencing disks. The space savings from enabling deduplication on a VDI deployment can reach over 90% and other types of content can squeeze out 50% or more savings without an appreciable hit to performance. And it is included in your Windows license.
Windows Server 2012 R2 introduces several improvements to the data deduplication feature, including the ability to set up this feature on scale-out file servers and cluster shared volumes. The algorithm in Windows Server 2012 R2 is also specifically optimized to work with VHD and VHDX files; enhancements were made to how efficiently Windows writes to disk, and also how quickly the optimization algorithm works through a disk so that active VHDX files in a production deployment of a virtual desktop infrastructure were not punished from a performance standpoint. It is important to note that the Hyper-V host machine must be different than the machine hosting the storage volumes on which deduplication is enabled, as the deduplication may consume more system resources than Hyper-V can tolerate with an active virtual machine load.
Data deduplication jobs
Data deduplication consists of three types of jobs that are repeated at certain intervals:
Optimization jobs. These perform the core of the analysis and actual removal of duplicate content from a volume. These jobs also perform compression on chunks of files that remain based on built-in algorithms that balance the end size of the file with the performance impact of the decompression process.
Data scrubbing jobs. Sometimes during deduplication or the subsequent compression of file chunks of a volume, data becomes corrupted. This is verified through the use of checksum validation and checking the consistency of file metadata. In this event, data scrubbing jobs attempt to repair the corruption by either restoring a copy of the data from a private backup Windows keeps of frequently accessed deduplicated content, restoring the file off of a fault tolerant volume like Windows Storage Spaces, or replacing the corrupted chunk midstream with a new, correct chunk as it is being written.
Garbage collection jobs. These jobs pick up the unoptimized, or no longer needed, pieces of files and get rid of them to increase the available space on a volume.
Enabling data deduplication and configuring it
The easiest way to get started with data deduplication is to use PowerShell to add the feature to the server on which you want it installed. You can use Server Manager through the GUI, but there is a lot of clicking and selecting. The PowerShell way to do this only involves three cmdlets, so it is clearly less work -- something I am very much in favor of.
From that machine, open a PowerShell command line with administrative rights and enter the following cmdlets:
Add-WindowsFeature -name FS-Data-Deduplication
Once the right software components have been installed, a few more PowerShell cmdlets are required to get deduplication enabled on specific volumes. For example, the following cmdlet enables volume H: for data deduplication for virtual desktop infrastructure deployment purposes (hence the UsageType flag as "HyperV")
Enable-DedupVolume H: -UsageType HyperV
… whereas the following cmdlet enables regular data deduplication efforts on volume S.
Enable-DedupVolume S: -UsageType Default
By default, once you have run these cmdlets and set up your volumes, the optimization process will run every hour and the data scrubbing and garbage collection jobs are run once a week. You can start the process on demand with the following PowerShell cmdlet:
Start-DedupJob –Volume S: –Type Optimization
You can review the schedule Windows has set up automatically with the following cmdlet:
The optimization job can only run weekly from here; you will need to use a custom task set up within Task Scheduler to make the optimization job run more frequently.
I think in an era where we are putting entire copies of systems into one giant VHDX file, and the amount of data we are storing continues to increase, deduplication will become an important part of managing our available storage and controlling the cost of storing our data. Given data deduplication is available in the box with a Windows Server license, there is little reason not to use it today.
Learn more dedupe best practices
Seven things to know about Windows Server deduplication
How Windows Server 2012 dedupe will impact backup