Data deduplication is nothing new. Third-party vendors have used it for things like shrinking backup storage and WAN optimization for years. Even so, there has never been a native deduplication feature in the Windows operating
Like the third-party products that have existed for so long, the goal of Windows Server 8’s deduplication feature is to allow more data to reside in less space. Notice that I did not say that the deduplication feature allows more data to be stored in less space. Even though Windows Server 8 will support storage-level deduplication, it also supports deduplication for data that is in transit.
Even though deduplication is new to the Windows operating system, Microsoft products have used various methods of increasing storage capacity for quite some time. For instance, the Windows operating system has long supported file system (NTFS) level compression. Likewise, some previous versions of Exchange Server sought to maximize the available storage space through the use of Single Instance Storage (SIS). Although such technologies do help to decrease storage costs, neither NTFS compression nor Single Instance Storage is as efficient as Windows Server 8’s deduplication feature.
According to Microsoft’s estimates, Windows Server 8’s deduplication feature should be able to deliver an optimization ratio of 2:1 for general data storage when it ships late this year. This ratio could increase to as much as 20:1 in virtual server environments.
How storage deduplication works
The reason why Windows Server 8’s deduplication feature will be more efficient than Single Instance Storage is because SIS works at the file level. In other words, if two identical copies of a file need to exist on a server then Single Instance Storage only stores a single copy of the file, but uses pointers to achieve the illusion that multiple copies of the file exist. Although this technique works really well for servers containing a lot of identical files, it doesn’t do anything for files that are similar, but not identical.
More on data deduplication
Data deduplication 101
The benefits of deduplication and where you should dedupe your data
13 data deduplication optimization guidelines
To further illustrate this point, consider the invoices that I send to my clients each month. The invoices exist as Microsoft Word documents, and each document is identical except for the date and the invoice number. Even so, Single Instance Storage would do nothing to reduce the space consumed by these documents.
Deduplication works at the block level rather than the file level. Each file is divided into small chunks. These chunks are of variable sizes, but range from 32 KB to 128 KB. Hence, a single file could be made up of many chunks.
The operating system will compute a hash for each chunk. The hash values are then compared as a way of determining which chunks are identical. When identical chunks are found, all but one copy of the chunk is deleted. The file system uses pointers to reference which chunks go with which files. One way of thinking of this process is that legacy file systems typically treat files as streams of data. However, Windows Server 8’s file system (with deduplication enabled) will treat files more as a collection of chunks.
Incidentally, the pre-beta version of Windows Server 8 uses file system compression. Whenever possible, the individual chunks of data will be compressed to save space.
One of the major concerns often expressed with regard to deduplication is file integrity. Although the odds are astronomical, it is theoretically possible for two dissimilar blocks of data to have identical hashes. Some third-party products solve this problem by recalculating the hash using a different and more complex formula prior to deleting duplicate chunks as a way of verifying that the chunks really are identical.
Although Microsoft has not specified the exact method that it will use to preserve data integrity, the Windows Server 8 Developer Preview Reviewer’s Guide indicates that the operating system “leverages checksum, consistency, and identity validation to ensure data integrity.” Furthermore, the operating system uses redundancy for certain types of data chunks as a way of preventing data loss.
As previously mentioned, Windows Server 8 will allow for the deduplication of both stored data and data in transit. Deduplication techniques similar to those that were previously described are going to be integrated with BranchCache as a way of minimizing the amount of data that must be transmitted over WAN links. These early builds suggest that the native deduplication feature will be able to conserve a significant amount of storage space without adversely affecting file system performance.
ABOUT THE AUTHOR
Brien M. Posey, MCSE, is a Microsoft Most Valuable Professional for his work with Windows 2000 Server, Exchange Server and IIS. He has served as CIO for a nationwide chain of hospitals and was once in charge of IT security for Fort Knox. He writes regularly for TechTarget sites.
This was first published in March 2012