Talk to most administrators about deduplication and the usual response is: Why? Disk space is getting cheaper all the time, with I/O speeds ramping up along with it. The discussion often ends there with a shrug.
But the problem isn't how much you're storing or how fast you can get to it. The problem is whether the improvements in storage per gigabyte or I/O throughputs are being outpaced by the amount of data being stored in your organization. The more we can store, the more we do store. And while deduplication is not a magic bullet, it is one of many strategies that can be used to cut into data storage demands.
Microsoft added a deduplication subsystem feature in Windows Server 2012, which provides a way to perform deduplication on all volumes managed by a given instance of Windows Server. Instead of relegating deduplication duty to a piece of hardware or a software layer, it's done in the OS on both a block and file level -- meaning that many kinds of data (such as multiple instances of a virtual machine) can be successfully deduplicated with minimal overhead.
If you plan to implement Windows Server 2012 deduplication technology, be sure you understand these seven points:
1. Deduplication is not enabled by default
Don't upgrade to Windows Server 2012 and expect to see space savings automatically appear. Deduplication is treated as a file-and-storage service feature, rather than a core OS component. To that end, you must enable it and manually configure it in Server Roles | File And Storage Services | File and iSCSI Services. Once enabled, it also needs to be configured on a volume-by-volume basis.
2. Deduplication won't burden the system
Microsoft put a fair amount of thought into setting up deduplication so it has a small system footprint and can run even on servers that have a heavy load. Here are three reasons why:
a. Content is only deduplicated after n number of days, with n being 5 by default, but this is user-configurable. This time delay keeps the deduplicator from trying to process content that is currently and aggressively being used or from processing files as they're being written to disk (which would constitute a major performance hit).
b. Deduplication can be constrained by directory or file type. If you want to exclude certain kinds of files or folders from deduplication, you can specify those as well.
c. The deduplication process is self-throttling and can be run at varying priority levels. You can set the actual deduplication process to run at low priority and it will pause itself if the system is under heavy load. You can also set a window of time for the deduplicator to run at full speed, during off-hours, for example.
This way, with a little admin oversight, deduplication can be put into place on even a busy server and not impact its performance.
3. Deduplicated volumes are 'atomic units'
'Atomic units' mean that all of the deduplication information about a given volume is kept on that volume, so it can be moved without injury to another system that supports deduplication. If you move it to a system that doesn't have deduplication, you'll only be able to see the nondeduplicated files. The best rule is not to move a deduplicated volume unless it's to another Windows Server 2012 machine.
4. Deduplication works with BranchCache
If you have a branch server also running deduplication, it shares data about deduped files with the central server and thus cuts down on the amount of data needed to be sent between the two.
5. Backing up deduplicated volumes can be tricky
A block-based backup solution -- e.g., a disk-image backup method -- should work as-is and will preserve all deduplication data.
File-based backups will also work, but they won't preserve deduplication data unless they're dedupe-aware. They'll back up everything in its original, discrete, undeduplicated form. What's more, this means backup media should be large enough to hold the undeduplicated data as well.
The native Windows Server Backup solution is dedupe-aware, although any third-party backup products for Windows Server 2012 should be checked to see if deduplication awareness is either present or being added in a future revision.
6. More is better when it comes to cores and memory
Microsoft recommends devoting at least one CPU core and 350 MB of free memory to process one volume at a time, with around 100 GB of storage processed in an hour (without interruptions) or around 2 TB a day. The more parallelism you have to spare, the more volumes you can simultaneously process.
7. Deduplication mileage may vary
Microsoft has crunched its own numbers and found that the nature of the deployment affected the amount of space savings. Multiple OS instances on virtual hard disks (VHDs) exhibited a great deal of savings because of the amount of redundant material between them; user folders, less so.
In its rundown of what are good and bad candidates for deduping, Microsoft notes that live Exchange Server databases are actually poor candidates. This sounds counterintuitive; you'd think an Exchange mailbox database might have a lot of redundant data in it. But the constantly changing nature of data (messages being moved, deleted, created, etc.) offsets the gains in throughput and storage savings made by deduplication. However, an Exchange Server backup volume is a better candidate since it changes less often and can be deduplicated without visibly slowing things down.
How much you actually get from deduplication in your particular setting is the real test for whether to use it. Therefore, it's best to start provisionally, perhaps on a staging server where you can set the "crawl rate" for deduplication as high as needed, see how much space savings you get with your data and then establish a schedule for performing deduplication on your own live servers.
About the author
Serdar Yegulalp has been writing about computers and information technology for more than 15 years for a variety of publications, including InformationWeek and Windows Magazine. Check out his blog at GenjiPress.com.