Skip to content

Conversation

@mgunnala
Copy link
Contributor

@mgunnala mgunnala commented Nov 4, 2025

Description

Issue #


This PR updates the extension signature validation code to run with a separate CPU quota (currently set to 50%), to improve validation performance and reduce impact on goal state processing time.

PR information

  • Ensure development PR is based on the develop branch.
  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

# the validation process may take an excessive amount of time. As a workaround, if cgroups are enabled,
# we run extension signature validation in a separate cgroup with its own dedicated CPU quota.
if CGroupConfigurator.get_instance().enabled():
systemd_cmd = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors contacting the dbus API are not uncommon, In that case, the extension should still be executed. See the error handling in start_extension_command

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, I've updated the error handling logic to re-run the openssl command without 'systemd-run' in the case of systemd failures.

@mgunnala mgunnala marked this pull request as ready for review November 19, 2025 22:22
try:
run_command(systemd_cmd, encode_output=False)
except CommandError as ex:
# If the systemd-run invocation itself failed (e.g., systemd not available, access denied, bus errors),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: D-bus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks

# log a warning and fall back to running command directly. If the openssl command failed, re-raise and do not retry.
stderr_str = ex.stderr.decode('utf-8') if isinstance(ex.stderr, bytes) else ex.stderr
unit_not_found = "Unit {0} not found.".format(EXT_SIGNATURE_VALIDATION_CGROUPS_UNIT)
is_systemd_failure = unit_not_found in stderr_str or EXT_SIGNATURE_VALIDATION_CGROUPS_UNIT not in stderr_str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we share the same logic what we do for extensions-run? If we need to change at some point how we detect, we may miss or required to change in all places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a util function "is_systemd_failure()" in cgroupsapi.py, and used that function both in start_extension_command() and in validate_signature(). Let me know if this looks okay to you.

message="'systemd-run' invocation failed for signature validation, falling back to direct execution. Error: '{0}'".format(ex.stderr),
name=name, version=version, duration=0)
# Run without systemd
run_command(base_command, encode_output=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already mentioned to you, just pointing out again to let others know. When we run directly, process now placed in agent cgroup and it will get agent limits. Today agent cpu at 50% which is matching what you are doing but quota may change later. We need to think about long term solution what we want to do.

Context: On the extension run, when systemd-run failures, we reset the quota and disable the cgroups feature in the agent. So that extensions run in the agent cgroup without limits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in our sync: we should reset the quota and disable the cgroups feature in the agent, in the case of systemd-run failures for signature validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly

@mgunnala mgunnala force-pushed the sig_quota branch 2 times, most recently from 34d6d0f to 01b182d Compare December 10, 2025 20:21
EXTENSION_SLICE_PREFIX = "azure-vmextensions"


def is_systemd_failure(unit_name, stderr):
Copy link
Member

@narrieta narrieta Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to azurelinuxagent/common/osutil/systemd.py and rename to is_systemd_run_failure. Also, since now this is a public method, document how it decides whether it is a systemd-run error or not.

if hasattr(stderr, 'seek') and hasattr(stderr, 'read'):
stderr.seek(0)
stderr_str = ustr(stderr.read(TELEMETRY_MESSAGE_MAX_LEN), encoding='utf-8', errors='backslashreplace')
elif isinstance(stderr, bytes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need both bytes and str?

:param unit_name: The name of the systemd unit/scope
:param stderr: Error output as str, bytes, or file-like object
:return: True if this is a systemd failure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:return: True if this is a systemd failure
:return: True if this is a systemd-run failure

'--slice={0}'.format(EXT_SIGNATURE_VALIDATION_SLICE), '--scope', '--property=CPUAccounting=yes',
'--property=CPUQuota={0}'.format(EXT_SIGNATURE_VALIDATION_CPU_QUOTA)] + base_command
try:
run_command(systemd_cmd, encode_output=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why encode_output=False?


original_run_command = shellutil.run_command

def run_command_mock(cmd, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the existing 'wraps' and 'call_list' instead of implementing your own, there are several samples in the code

also, we usually mock at the level of Popen, rather than run_command (protects against changes in run_command)

return shellutil.run_command(cmd, *args, **kwargs)

with patch("azurelinuxagent.ga.signature_validation_util.run_command", side_effect=mock_run_command_with_error):
with self.assertRaises(SignatureValidationError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a message to this assert

with patch("azurelinuxagent.ga.signature_validation_util.run_command", side_effect=run_command_with_systemd_failure):
validate_signature(self.vm_access_zip_path, self.vm_access_signature, self.package_name_and_version)

self.assertEqual(2, len(calls))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a message to this assert


self.assertEqual(2, len(calls))
# First command should be invoked via systemd-run
self.assertIn('systemd-run', ' '.join(calls[0]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a message to this assert

# First command should be invoked via systemd-run
self.assertIn('systemd-run', ' '.join(calls[0]))
# Second command should be a direct openssl call (no systemd-run)
self.assertNotIn('systemd-run', ' '.join(calls[1]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a message to this assert

# Verify that cgroups were disabled
self.assertEqual(1, mock_instance.disable.call_count, "disable() should have been called exactly once")
reason = mock_instance.disable.call_args[1]['reason']
self.assertIn("'systemd-run' invocation failed for signature validation", reason)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in several checks you do "foo in formatted_command" (or similar), could you be more strict and so "startswith" or command[0] == 'foo' (or similar)

Copy link
Contributor

@maddieford maddieford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No additional comments from my side. I'll give it another review after the open comments are resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants