Problem solve Get help with specific problems with your technologies, process and projects.

Unwinding USN rollback when faced with AD replication failure

No matter what the cause, solving Active Directory replication failure can be a difficult task. When USN rollback is involved, however, it presents even more of a challenge.

As I noted in my previous article on solving unintentional replication failure, updated sequence number (or USN)...

rollback is a possible cause.

Microsoft's KB 875495 and KB 885875 articles indicate "it is difficult to detect and recover from a USN rollback." Some interpret this to mean that there is some magical error that inexplicably causes a USN rollback. Remember, a USN rollback is used -- often intentionally -- to roll Active Directory back to a previous known state by means of correcting an error such as accidental deletion of objects.

While USN rollback is a powerful tool in the recovery of objects, unintentional USN rollback will cause replication failure and an inconsistent Active Directory that produces no errors and is virtually undetectable by normal means, such as RepAdmin/showrepl. This has much more severe consequences than simply disabling replication. It is described quite well in KB 875495 with some excellent examples, but it does take a while to understand it.

What causes USN rollback?

Very simply, USN rollback is typically caused by restoring a domain controller using an image from a product like Norton's Ghost software, a volume snapshot or a virtual machine image used in VMware or Microsoft virtualization products. Other scenarios are listed in the KB, which states:

"The only supported way to roll back the contents of Active Directory or the local state of an Active Directory domain controller is to use an Active Directory-aware backup and restoration utility to restore a system state backup that originated from the same operating system installation and the same physical or virtual computer that is being restored."

The key to making a rollback work is resetting the invocation ID for the AD database. The invocation ID tracks the version of the database on a DC. If you don't reset this invocation ID is not reset when the database is restored, it will cause gaps in Active Directory between the restored DC and other DCs. 

The problem comes when you use an unsupported method -- such as with an image -- to restore a DC to a previous state without resetting the invocation ID. This prevents the other DCs from replicating changes that were made after the image was recorded, up to the current time. Normally you'd think the changes would be replicated. However, the other DCs will act as if they have replicated with the restored DC so no changes will be replicated. In fact, they will never replicate.

Tools like RepAdmin/showreps will show that replication is working, as replication events do not detect the replication failure here. You may detect inconsistencies in Active Directory, however.

For instance, if you have user accounts created on a DC after the backup was taken on the restored DC, and since the restored DC will never get those accounts due to the invocation ID not being reset, there could be authentication failure when users attempt to authenticate against the restored DC. Of course this affects all objects such as replication topology, FSMO role holders, security groups and memberships, DNS records and others. To make matters worse, this DC will never catch up, since the other DCs will act as if the restored DC has the missing objects.

Detecting USN rollback

The challenge is to detect and fix this. KB 875495 is actually a hotfix for pre-Windows 2003 SP1 servers that causes Event 2095 to be logged if a DC sends a USN that was previously known without a change in the invocation ID. This will pause the Netlogon service on the restored domain controller, preventing authentication to that DC. The KB article also provides a sample Event 2095 with an extensive description of the problem and actions needed.

There are other ways to detect USN rollback, including with the use of the RepAdmin/showutdvec command, which KB 875495 describes. With this command, you can show the up-to-dateness vector table on each DC and see that there is a discrepancy. Of course, you have to first suspect the problem or monitor the event logs for Event 2095 to use this command. If you have restored a DC using an image, snapshot or a virtual machine image, you should monitor for this error.

The solution here is a familiar one -- manually demote the restored DC and re-promote it. That's the only way you'll get the AD to be consistent on all DCs. Of course, you can also use the Install From Media (IFM) feature to save time, but remember: You still must use an "AD aware" backup software to back up a valid DC in the domain for the IFM procedure to work. For more details on this, check out my article on Recovering a DC using Install From Media.

The best solution is to not use one of the "illegal" restoration methods noted in this article and in KB 875495, especially if you use an "AD aware" backup software. If you are using virtualization for hosting DCs, take note of KB 888794 for more information.

Gary OlsenGary Olsen is a systems software engineer for Hewlett-Packard in Global Solutions Engineering. He authored Windows 2000: Active Directory Design and Deployment and co-authored Windows Server 2003 on HP ProLiant Servers. Gary is a Microsoft MVP for Directory Services and formerly for Windows File Systems.

Dig Deeper on Windows systems and network management