Set separate CPU quota for extension signature validation #3488

mgunnala · 2025-11-04T21:30:46Z

Description

Issue #

This PR updates the extension signature validation code to run with a separate CPU quota (currently set to 50%), to improve validation performance and reduce impact on goal state processing time.

PR information

Ensure development PR is based on the develop branch.
The title of the PR is clear and informative.
There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
If applicable, the PR references the bug/issue that it fixes in the description.
New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

I have read the contribution guidelines.

azurelinuxagent/ga/signature_validation_util.py

narrieta · 2025-11-05T00:28:38Z

azurelinuxagent/ga/signature_validation_util.py

+        # the validation process may take an excessive amount of time. As a workaround, if cgroups are enabled,
+        # we run extension signature validation in a separate cgroup with its own dedicated CPU quota.
+        if CGroupConfigurator.get_instance().enabled():
+            systemd_cmd = [


Errors contacting the dbus API are not uncommon, In that case, the extension should still be executed. See the error handling in start_extension_command

Thanks for pointing this out, I've updated the error handling logic to re-run the openssl command without 'systemd-run' in the case of systemd failures.

azurelinuxagent/ga/signature_validation_util.py

nagworld9 · 2025-11-20T18:07:06Z

azurelinuxagent/ga/signature_validation_util.py

+            try:
+                run_command(systemd_cmd, encode_output=False)
+            except CommandError as ex:
+                # If the systemd-run invocation itself failed (e.g., systemd not available, access denied, bus errors),


Updated, thanks

nagworld9 · 2025-11-20T18:09:51Z

azurelinuxagent/ga/signature_validation_util.py

+                # log a warning and fall back to running command directly. If the openssl command failed, re-raise and do not retry.
+                stderr_str = ex.stderr.decode('utf-8') if isinstance(ex.stderr, bytes) else ex.stderr
+                unit_not_found = "Unit {0} not found.".format(EXT_SIGNATURE_VALIDATION_CGROUPS_UNIT)
+                is_systemd_failure = unit_not_found in stderr_str or EXT_SIGNATURE_VALIDATION_CGROUPS_UNIT not in stderr_str


can we share the same logic what we do for extensions-run? If we need to change at some point how we detect, we may miss or required to change in all places.

I've added a util function "is_systemd_failure()" in cgroupsapi.py, and used that function both in start_extension_command() and in validate_signature(). Let me know if this looks okay to you.

nagworld9 · 2025-11-20T18:23:25Z

azurelinuxagent/ga/signature_validation_util.py

+                        message="'systemd-run' invocation failed for signature validation, falling back to direct execution. Error: '{0}'".format(ex.stderr),
+                        name=name, version=version, duration=0)
+                    # Run without systemd
+                    run_command(base_command, encode_output=False)


I already mentioned to you, just pointing out again to let others know. When we run directly, process now placed in agent cgroup and it will get agent limits. Today agent cpu at 50% which is matching what you are doing but quota may change later. We need to think about long term solution what we want to do.

Context: On the extension run, when systemd-run failures, we reset the quota and disable the cgroups feature in the agent. So that extensions run in the agent cgroup without limits.

Discussed in our sync: we should reset the quota and disable the cgroups feature in the agent, in the case of systemd-run failures for signature validation.

Updated accordingly

azurelinuxagent/ga/signature_validation_util.py

narrieta · 2025-12-16T18:27:44Z

azurelinuxagent/ga/cgroupapi.py

 EXTENSION_SLICE_PREFIX = "azure-vmextensions"


+def is_systemd_failure(unit_name, stderr):


Move to azurelinuxagent/common/osutil/systemd.py and rename to is_systemd_run_failure. Also, since now this is a public method, document how it decides whether it is a systemd-run error or not.

narrieta · 2025-12-16T19:14:07Z

azurelinuxagent/ga/cgroupapi.py

+    if hasattr(stderr, 'seek') and hasattr(stderr, 'read'):
+        stderr.seek(0)
+        stderr_str = ustr(stderr.read(TELEMETRY_MESSAGE_MAX_LEN), encoding='utf-8', errors='backslashreplace')
+    elif isinstance(stderr, bytes):


why do we need both bytes and str?

narrieta · 2025-12-16T19:14:29Z

azurelinuxagent/ga/cgroupapi.py

+    
+    :param unit_name: The name of the systemd unit/scope
+    :param stderr: Error output as str, bytes, or file-like object
+    :return: True if this is a systemd failure


Suggested change

:return: True if this is a systemd failure

:return: True if this is a systemd-run failure

narrieta · 2025-12-16T19:47:59Z

azurelinuxagent/ga/signature_validation_util.py

+                            '--slice={0}'.format(EXT_SIGNATURE_VALIDATION_SLICE), '--scope', '--property=CPUAccounting=yes',
+                            '--property=CPUQuota={0}'.format(EXT_SIGNATURE_VALIDATION_CPU_QUOTA)] + base_command
+            try:
+                run_command(systemd_cmd, encode_output=False)


why encode_output=False?

narrieta · 2025-12-16T20:28:58Z

tests/ga/test_signature_validation_sudo.py

+
+        original_run_command = shellutil.run_command
+
+        def run_command_mock(cmd, *args, **kwargs):


you can use the existing 'wraps' and 'call_list' instead of implementing your own, there are several samples in the code

also, we usually mock at the level of Popen, rather than run_command (protects against changes in run_command)

narrieta · 2025-12-16T20:37:48Z

tests/ga/test_signature_validation_sudo.py

+                return shellutil.run_command(cmd, *args, **kwargs)
+
+            with patch("azurelinuxagent.ga.signature_validation_util.run_command", side_effect=mock_run_command_with_error):
+                with self.assertRaises(SignatureValidationError):


add a message to this assert

narrieta · 2025-12-16T20:37:59Z

tests/ga/test_signature_validation_sudo.py

+            with patch("azurelinuxagent.ga.signature_validation_util.run_command", side_effect=run_command_with_systemd_failure):
+                validate_signature(self.vm_access_zip_path, self.vm_access_signature, self.package_name_and_version)
+
+                self.assertEqual(2, len(calls))


add a message to this assert

narrieta · 2025-12-16T20:38:06Z

tests/ga/test_signature_validation_sudo.py

+
+                self.assertEqual(2, len(calls))
+                # First command should be invoked via systemd-run
+                self.assertIn('systemd-run', ' '.join(calls[0]))


add a message to this assert

narrieta · 2025-12-16T20:38:16Z

tests/ga/test_signature_validation_sudo.py

+                # First command should be invoked via systemd-run
+                self.assertIn('systemd-run', ' '.join(calls[0]))
+                # Second command should be a direct openssl call (no systemd-run)
+                self.assertNotIn('systemd-run', ' '.join(calls[1]))


add a message to this assert

narrieta · 2025-12-16T20:41:51Z

tests/ga/test_signature_validation_sudo.py

+                # Verify that cgroups were disabled
+                self.assertEqual(1, mock_instance.disable.call_count, "disable() should have been called exactly once")
+                reason = mock_instance.disable.call_args[1]['reason']
+                self.assertIn("'systemd-run' invocation failed for signature validation", reason)


in several checks you do "foo in formatted_command" (or similar), could you be more strict and so "startswith" or command[0] == 'foo' (or similar)

maddieford

No additional comments from my side. I'll give it another review after the open comments are resolved

mgunnala and others added 3 commits October 28, 2025 18:10

Set CPU quota for signature validation

344efef

Add UT

cfdafb3

Merge branch 'develop' into sig_quota

d22ec02

narrieta reviewed Nov 5, 2025

View reviewed changes

azurelinuxagent/ga/signature_validation_util.py Outdated Show resolved Hide resolved

narrieta reviewed Nov 5, 2025

View reviewed changes

mgunnala and others added 2 commits November 17, 2025 12:03

Merge branch 'develop' into sig_quota

020de1c

Update systemd error handling

b2009b4

mgunnala commented Nov 19, 2025

View reviewed changes

azurelinuxagent/ga/signature_validation_util.py Show resolved Hide resolved

mgunnala marked this pull request as ready for review November 19, 2025 22:22

mgunnala requested review from ZhidongPeng, maddieford and nagworld9 as code owners November 19, 2025 22:22

nagworld9 reviewed Nov 20, 2025

View reviewed changes

maddieford reviewed Nov 20, 2025

View reviewed changes

azurelinuxagent/ga/signature_validation_util.py Show resolved Hide resolved

Add is_systemd_failure util function

fc2254c

mgunnala force-pushed the sig_quota branch from fffd387 to e4f5bda Compare November 21, 2025 19:07

Fix UT

6a34f6a

mgunnala force-pushed the sig_quota branch from e4f5bda to 6a34f6a Compare November 21, 2025 19:15

mgunnala and others added 3 commits December 2, 2025 09:34

Merge branch 'develop' into sig_quota

705251e

Disable cgroups on systemd-run error

3cfff5b

Merge branch 'develop' into sig_quota

8fcf6de

mgunnala force-pushed the sig_quota branch 2 times, most recently from 34d6d0f to 01b182d Compare December 10, 2025 20:21

Fix UT

09c2236

mgunnala force-pushed the sig_quota branch from 01b182d to 09c2236 Compare December 10, 2025 20:31

narrieta reviewed Dec 16, 2025

View reviewed changes

Merge branch 'develop' into sig_quota

ce98917

maddieford reviewed Dec 17, 2025

View reviewed changes

mgunnala added 2 commits December 18, 2025 17:31

Update UT and address comments

82b373f

Update assert msg

bab7989

Pylint

9467455

mgunnala force-pushed the sig_quota branch from 3a12663 to 9467455 Compare December 19, 2025 00:14

		EXTENSION_SLICE_PREFIX = "azure-vmextensions"


		def is_systemd_failure(unit_name, stderr):

	:return: True if this is a systemd failure
	:return: True if this is a systemd-run failure


		original_run_command = shellutil.run_command

		def run_command_mock(cmd, args, *kwargs):

Set separate CPU quota for extension signature validation #3488

Are you sure you want to change the base?

Set separate CPU quota for extension signature validation #3488

Uh oh!

Conversation

mgunnala commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR information

Quality of Code and Contribution Guidelines

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

narrieta Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maddieford left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mgunnala commented Nov 4, 2025 •

edited

Loading

narrieta Dec 16, 2025 •

edited

Loading