Problem solve Get help with specific problems with your technologies, process and projects.

Case Study: Troubleshooting Distributed File System Replication

One of the difficulties with DFSR is that many admins still try to troubleshoot it using FRS methods. MVP Gary Olsen offers a case study to exemplify the proper techniques for debugging DFSR.

Although Microsoft DFSR (Distributed File System Replication) has some great advantages, troubleshooting DFSR problems...

can be a bit of a mystery. Part of the problem is that most administrators who are used to File Replication Service (FRS) often try to debug DFSR problems using FRS techniques.

The problem

Recently I was consulted about a situation in which a failure occurred on a disk where DFSR data was replicated to. After restoring the disk, it appeared that the staging files were filling up with backlogged files and that DFSR was replicating backwards. Basically, it seemed to be replicating from the target back to the source and causing data loss.

The configuration was a hub site with a server, attached to a storage array. There were 20 or so remote sites, each with a server hosting network shares that users saved files to. DFSR was used to replicate each site's share data to a corresponding share on the hub server's storage array, which was then backed up.

The problem seemed to start when a disk drive on the storage array failed. At that same time, a user complained that the file he was accessing was "old" data. That is, the changes he made the previous day were not there, as if the old file had overwritten his newer version. The admins had developed a script they used to determine the health of the Distributed File System. This script showed a large number of files in the staging directories on several shares at the hub site.

Searching for a data loss fix

The administrators believed the staging directories should be empty. They also believed that the files were going from the target server's share, back to the staging directory and then back to the source, thus putting old data back and replacing newer data on the source shares. To prevent any data loss, the admins disabled replication for three replication groups.

In addition, they wanted to configure the replication groups for one-way replication, so it would only replicate from source (remote site) to hub. After all, why would you ever want to replicate from the hub to the remote site?

To summarize, it all broke down like this:

  • A disk, hosting DFSR replicated shares, failed.
  • Old data apparently was being replicated from target to source.
  • There were several DFSR shares that had upward of 250,000 "files" in the staging directories. This was interpreted as a problem, as it was in FRS.
  • The admins disabled replication on four replication groups to prevent old data from replicating over new data.
  • They wanted to solve this by disabling replication from the target to the source.
  • The admins had tried to force replication in the right direction by using the DFSRadmin command to specify the isPrimary flag (as I discussed in my previous article on configuring replication groups).

To troubleshoot this problem, let's look at each of these points individually and see where it takes us.

First of all, the only way a file at the target could overwrite the same file at the source is if the target's version was modified and had a timestamp newer than the source's version. It cannot replicate old data. After checking further, administrators found that only one user had that problem, so they attributed it to a user error.

As far as the "files" in the staging directory, there were not only files in there, but they had changes in them as well (i.e., RDC signatures, RDC hashes and USN Journal data, not to mention file data).

There is not necessarily a 1:1 relationship between entries in the staging directory and the physical files. Data in the staging directory is used to determine if the file on the source or target is more recent, and to then replicate accordingly. If you have a dynamic environment with large amounts of data being replicated, it's not unusual to see a large amount of data in the staging directory. Therefore, the admins assumed that something was wrong.

Next, they wanted to modify the replication link to be unidirectional and replicate from source to target only. Microsoft strongly warns against doing this, as it would prevent proper evaluation of the files and would probably break replication.

Note: The concept of "hub and spoke" is in the mind of the administrator. DFSR just replicates the newest data -- wherever it is -- to the other end of the replication link. It is multi-master replication and you should not attempt to change it.

It's a waste of keystrokes to use the DFSRadmin "isPrimary" flag to kick-start replication. This flag is only for the benefit of initial population of the target share. After the share is populated upon creation of the replication group, you can set this flag all you want -- but it won't make any difference. Once initial replication has occurred, this flag is automatically cleared. Setting isPrimary manually will only help if initial replication is not working.

The solution

At this point we have debunked a number of faulty assumptions that the admins made in diagnosing this problem. The fact is, the only real problem was the failed disk. While it was offline, DFSR was working just the way it was supposed to -- saving all the changes in the staging directories. This was a good thing. If the admins had realized that and consequently discovered that the "old data overwriting new" was only one user, and just left everything alone, the problem would have self-healed.

In the end, the solution was simply to get the disk back online and restore the backed up data, then enable the replication links and let it all converge. In fact, DFSR really is quite self-healing, and is built to handle a large amount of data.

Here are some good tips for troubleshooting Distributed File System Replication:

  1. There is a specific DFSR event log that will appear on DFSR servers. Use this event log when looking for errors and warnings related to DFSR.
  2. To get a good DFSR health check, use the DFSRadmin utility's "health" parameter. This tool was engineered by Microsoft to give administrators everything they need.
  3. dfsradmin health new /rgname:dfs_data /refMemName:SRV1 /repname:c:\dfsreports\SRV1-DFShealth.html /fsCount:True


    Rgname -- the replication group name
    refMemName -- the name of the server
    rename -- the name of the report
    fscount -- specifies whether to count the files in each folder

    You can get help for this command with: dfsradmin health new /?

  4. This puts an HTML (and optionally an XML) file in the DFSReports directory and allows you to use these files to easily script a program to collect reports from all Distributed File System servers.
  5. Of course, the best troubleshooting tool is knowledge. By referring to Microsoft's DFS Web site, you'll find a plethora of help articles, including:
    * A collection of DFSR frequently asked questions
    * An excellent step-by-step guide for the DFS solution in Windows Server 2003 R2
    * An information-stocked document titled Designing Distributed File Systems

Do you have an Active Directory issue or problem that you'd like Gary to write an article about? Email him at [email protected]. Note: Gary cannot answer each query personally or guarantee that all will be answered. However those queries that have widespread interest or involve common AD issues will be addressed.

Gary Olsen is a systems software engineer for Hewlett-Packard in Global Solutions Engineering. He authored Windows 2000: Active Directory Design and Deployment and co-authored Windows Server 2003 on HP ProLiant Servers. Gary is a Microsoft MVP for Directory Services and formerly for Windows File Systems.

Dig Deeper on Windows administration tools