Skip to content

[BUG] NetworkManager restart during hostname publish may cause loss of network connectivity #3356

@hakonhall

Description

@hakonhall

There is a ~10% chance that the primary IPv4 address is removed from the eth0 network interface during the initial boot of a newly provisioned VM.

It is NetworkManager that removes it. In fact it appears to randomly remove (and sometimes add back) any of the IP addresses on eth0 (we have one primary, one secondary, and one site-local IPv6). The changes are always done right after the waagent detects the hostname has been changed to the Azure VM name, i.e. the log contains e.g. "Detected hostname change: pkrvmoq00ytv5qk -> h12077".

This matches the issue referred to in #3008: "When the agent publishes an updated hostname to DNS, it restarts the NM and then restarts the interface configuration manually. This can lead to a race condition...".
The fix for #3008 is for waagent to not restart the NM just before the hostname is published. We have verified that this also fixes our issue.

Unfortunately, #3032 limited the fix of #3008 to RHEL [7, 8.6) factory.py#L119 because the changes beyond 8.6 "have not been stress tested on the distros which use RedhatOSModernUtil, and we have not reproduced the race condition using RedhatOSModernUtil". Now we have been able to reproduce the issue on AlmaLinux 8.10.

Therefore, in order verify the fix worked for us, we had to patch the waagent egg that runs on boot as follows:

--- a/azurelinuxagent/common/osutil/redhat.py	2025-04-09 14:45:11.430807287 +0000
+++ b/azurelinuxagent/common/osutil/redhat.py	2025-04-09 14:46:15.065608134 +0000
@@ -270,5 +270,6 @@
         # RedhatOSUtil was updated to conditionally run NetworkManager restart in response to a race condition between
         # NetworkManager restart and the agent restarting the network interface during publish_hostname. Keeping the
         # NetworkManager restart in RedhatOSModernUtil because the issue was not reproduced on these versions.
-        shellutil.run("service NetworkManager restart")
+        logger.warn("patch: not restarting NetworkManager")
+        #shellutil.run("service NetworkManager restart")
         DefaultOSUtil.publish_hostname(self, hostname)

Distro and WALinuxAgent details (please complete the following information):

  • Distro and Version: AlmaLinux 8.10
  • WALinuxAgent version: 2.13.1.1

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions