Recently I had a situation where an external time server sent incorrect time to a company's primary domain controller
and caused severe replication failures until we figured out what happened. This was an interesting exercise not only in diagnosing such a problem but also in fixing it.
Because Kerberos security is inherently dependant upon all computers being within a small time skew -- five minutes by default in an Active Directory domain -- it needs a reliable time reference. Within the domain, this simply means that all computers' system clocks must be within five minutes of any other computer in the domain and that time zone calculations are not a factor. Furthermore, that is relative, not necessarily actual time.
So if all of the computers on the network think it's 10:05 a.m. on July 5, 1999, then there is no problem. If you are scheduling meetings in Outlook, then, of course, it will be a problem. But AD certainly doesn't care because the clocks are all within five minutes of each other.
When considering how Windows time services and Kerberos authentication work together to provide reliable and secure communication -- not only for user logon but also for various AD services such as replication – look at the role of an external time source. The PDC emulator is the default time server in a Windows domain. All DCs source from the domain PDC, and all clients will source time from their authenticating DC.
An option is to sync the PDC with an external time source. In the United States, the U.S. Naval Observatory is a popular one. Again, this isn't necessary for Active Directory, but many applications such as Outlook and other scheduling applications do need accurate time, so most organizations do sync to an external time source.
In the case I mentioned, the company was not in the U.S., and it had chosen to sync with its country's government-controlled time server. The initial problem was that replication was failing, and the domain controllers were logging Event ID 4 in the system event log. Interestingly, there were several different descriptions for Event 4, but the one we looked closely at was this one:
1/21/2008 11:38:24 PM 1 0 4 Kerberos
N/A Corp-DC1 The kerberos client received a
KRB_AP_ERR_MODIFIED error from the server host/Corp-PDC.Corp.net. The target name used was cifs/Corp-PDC.Corp.net. The target name used was cifs/Corp-PDC.Corp.net. This indicates that the password used to encrypt the kerberos service ticket is different than that on the target server. Commonly, this is due to identically named machine accounts in the target realm (Corp.NET), and the client realm.
We picked up various other events, such as Event ID 1925 in the DS log with "target principal name is incorrect." In addition, DCdiag showed these replication errors:
[Replications Check,Corp-DC1] A recent replication attempt
failed: From Corp-PDC to Corp-DC1 Naming Context:
The replication generated an error (-2146893022):
The target principal name is incorrect
That event and various Netlogon errors were filling up the system log on the DCs in one site as well as the PDC. The combination of a complaint in Event 4 about passwords used to encrypt the Kerberos service ticket "being out of synch between the DC and PDC" and another error in the event log, "target principal name is incorrect," indicated that most likely the secure channel password was out of sync. Attempting to force replication produced the "target principal name is incorrect" error, and attempting to do a Net Time or Net View yielded "access denied" errors.
Correcting this is fairly simple – just use Netdom resetpwd to reset the secure channel password between the PDC and DCs. But another error was perplexing. Using the Repadmin /Replsum /bysrc /bydest /sort:delta showed that these same DCs had not replicated for the tombstone lifetime, so replication was shut off. Yet the last successful replication was only a day or two ago.
We decided to fix the Secure Channel password issue with the following procedure:
- Purge the Kerberos tickets:
Klist Purge (manually enter a response to prompts for each ticket)
- Disable the KDC service and reboot the DC.
- After logging back in to the DC, run this command:
Netdom resetpwd /server:Corp-PDC /userd:Administrator /password:*
A confirmation message will state that the machine account has been reset.
- Set the KDC service to automatic and reboot the DC.
Note that it may be sufficient to simply stop the KDC service in step 2 and restart it in step 4 without rebooting. If that doesn't work, then use the reboot method shown here.
- You can map the IPC$ to generate new tickets:
Net Use \\Corp-PDC.corp.net\IPC$ /User:‹username›
- Optionally you can delete and regenerate the replication connections between the DC and PDC, but I found that isn't really necessary.
- Repeat for other DCs logging these errors.
That fixed the Kerberos errors, and Net Time and Net View worked. But why had replication failed because of the tombstone lifetime violation? Further investigation uncovered a W32Time error, shown in Figure 1.
Figure 1: A W32Time error was uncovered
This indicates that the time was set back by –31535999, which is one year in seconds. Then about an hour later, another Event 52 indicated it was corrected by +31535999 -- one year in seconds. That means the government's time server mistakenly set the clock back by one year. In the hour before it was corrected, machine passwords were reset and replication was attempted. Replication saw that the last time it replicates was a year out, so it stopped replication. We fixed this by setting the Lingering Object registry flags on the problem DCs:
ValueName = Strict Replication Consistency
Data Type = Reg_DWORD
Value Data = 0
Created the following value:
ValueName = Allow Replication with Divergent and Corrupt Partner
Data Type = Reg_DWORD
Value Data = 1
Then we waited for replication with all DCs and reset the "Strict Replication Consistency" value to 1 and deleted the "Allow replication with Divergent and corrupt partner" value. Replication was healthy again.
A week later, the exact same thing happened with DCs in several other sites. We again noticed two Event ID 52 warnings. We fixed it by resetting the SC password between each DC and the PDC using Netdom Resetpwd as shown previously and then opened up replication with the two registry keys, resetting them after successful replication.
So how could we prevent this? Microsoft has a cool solution. KB 884776 provides a registry value that disallows time changes by more than a defined amount. So we used that value to tell the time service to not allow a change by more than 15 minutes. You can set it for a + or – reset value. We set it for 15 minutes each way. We haven't seen the problem since.
The moral of the story is that synching to an external time source is something you have to worry about because, apparently, all time sources are not created equal. In this case, I advised them to find a better time source. They are not hard to find in a search on the internet -- just search for NTP Time servers. Make sure you use an NTP time server, not SNTP.
Gary Olsen is a systems software engineer for Hewlett-Packard in Global Solutions Engineering. He wrote Windows 2000: Active Directory Design and Deployment and co-authored Windows Server 2003 on HP ProLiant Servers. Olsen is a Microsoft MVP for Windows Server-File Systems.