Troubleshooting DHCP Failback Issues on VMware

In Part 3 I discussed how to activate the scope(s), add vendor custom classes and create a hot standby for failover and a couple of real world tips.

The next part of the blog discusses some issues I’ve found with running this on VMware and real-world information that you need to be aware of.

This will discuss the following points:

  • DHCP Synchronization issues, symptoms and how to investigate
  • How time sync works in VMware and how it affects DHCP.

Scenario

In this scenario we have DHCP globally distributed on a VMware platform in a hub and spoke topology. Sites are configured from all over Europe and are failed over to a hub in a single geographical location. This will mean that time zones across the infrastructure are different.

DHCP Symptoms that show time sync issues

During the failover as discussed in Part 3, a DHCP log is generated that shows when the switchover time completes and the primary server enters a state of ‘PARTNER DOWN’. Once you turn the server back on the state should change automatically to ‘RECOVER to RECOVER_WAIT‘. Where a time sync issue occurs then this will never be recorded in the log.

But what happens if you do not see this? Double check the server is powered on and has definitely come back online and is connected to the network, a simple ping test will prove this.

The next step is to logon and open up the DHCP console and check visually what is occurring. Open the properties on the IPv4 node and select the Failover tab. You may see the Status as Starting. This would indicate that something is wrong. The DHCP console does not provide any useful information so the next step is to look at the Event Logs

You can open the event logs for DHCP and search for Event ID 20253, which is covered fully in this article

You can use the GUI or PowerShell to find this Event ID, my preference is PowerShell.

Event Logs (GUI)

Open the Event Viewer and go to ‘Applications and Services Log>Microsoft>DHCP-Server>Microsoft-Windows-DHCP Server Events/Admin

Filter for EventID 20253

Event Logs (PowerShell)

Open PowerShell, (ISE, Windows Terminal or VSCode) as an admin

Get-WinEvent -FilterHashtable @{ ProviderName = ‘Microsoft-Windows-DHCP-Server’} | where {($_.Id -like “202*”) -and ($_.message -match “time synchronization*”) }  | ft -AutoSize -Wrap | Out-File C:\Temp\DHCP_Time_Sync_Log.txt

What Next?

If you do discover that you do indeed have a time sync issue, here’s a quick summary of the next steps and I’ll then discuss how to move forwards

  • You confirm that you have a time sync issue between the primary DHCP server (Spoke) and the failover partner (Hub)
Event ID20253
Date and TimeTime of detection of time being out of sync with partner server
ComputerDHCP server host name
UserNETWORK SERVICE
DescriptionThe server detected that it is out of time synchronization with partner server: <host name> for failover relationship: <relationship name>. The time is out of sync by: <# of seconds> seconds.
OpCodeTimeOutOfSync
Task CategoryDHCP Failover
LevelError
DHCP Time Sychronization Issue

  • How to safely failback the DHCP service back to the primary server
  • Review VMware ESXi hosts and check time source
  • Review VMware Tools to confirm time sync settings for Guest OS

Failback DHCP Service

As previous highlighted, when the DHCP service is unable to failback then the primary server will be stuck in a status of Starting. A simple way to fail this back is the following

  • Open Services and locate the DHCP Server service
  • Restart the service, this will force the service to failback
  • Restarting the actual server will not cause this to failback, so restart the service.
  • Alternatively use PowerShell – Restart-service dhcpserver

You can then check the DHCP server logs and the console to ensure that the failback has occurred

Get-WinEvent -FilterHashtable @{ ProviderName = ‘Microsoft-Windows-DHCP-Server’} | where {($_.Id -like “202*”) -and ($_.message -match “Relationship*”) }  | ft -AutoSize -Wrap | Out-File C:\Temp\Relationship.txt

If remediation action has worked successfully then you’ll see an entry in the log that highlights Conflict Done . The log does not explicitly tell you the DHCP service running on the primary server has returned to Normal. You can verify this in the DHCP Console and check the Failover tab to confirm the status. If this is now correct you can move on to investigating the ESXi host(s) that may have the wrong time set.

Review and Investigate ESXi Hosts and VMware Tools

  • On the ESXi Host check how the time is set and if it is using a time source
Real World Tips

If you have you DHCP servers distributed across a wide range of geographic locations this can introduce challenges based on the time zones in different countries that can cause DHCP sync issues, the following are true

  • The times on a ESXi host do not initially affect the DHCP server however on a restart of a Guest OS it will be forced to use the time of the ESXi host and will cause a time sync issue and the inability to failback

VMware’s documentation highlights this point in the following article

  • By default VMware Tools on Guest OS are not set to sync time from the ESXi host(s). You can however check this from the VMware’s documentation which I’ve linked above.
  • Check and confirm if you have any other Guest OS that have a reliance on the ESXi host time
  • If no problems exist update the ESXi host so it has a common time across all geographic locations

This is not an ideal solution but resolves issue, you should spend time looking into how the potential changes may affect other applications but without this DHCP failbacks will fail without manual intervention

To finish off check Event ID 20253, if the time sync issue has been resolved the logs should stop the errors being created and you should see a set of informational events being generated.

I hope this information is helpful

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s