-
Notifications
You must be signed in to change notification settings - Fork 388
Set separate CPU quota for extension signature validation #3488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
| # the validation process may take an excessive amount of time. As a workaround, if cgroups are enabled, | ||
| # we run extension signature validation in a separate cgroup with its own dedicated CPU quota. | ||
| if CGroupConfigurator.get_instance().enabled(): | ||
| systemd_cmd = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errors contacting the dbus API are not uncommon, In that case, the extension should still be executed. See the error handling in start_extension_command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out, I've updated the error handling logic to re-run the openssl command without 'systemd-run' in the case of systemd failures.
| try: | ||
| run_command(systemd_cmd, encode_output=False) | ||
| except CommandError as ex: | ||
| # If the systemd-run invocation itself failed (e.g., systemd not available, access denied, bus errors), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: D-bus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks
| # log a warning and fall back to running command directly. If the openssl command failed, re-raise and do not retry. | ||
| stderr_str = ex.stderr.decode('utf-8') if isinstance(ex.stderr, bytes) else ex.stderr | ||
| unit_not_found = "Unit {0} not found.".format(EXT_SIGNATURE_VALIDATION_CGROUPS_UNIT) | ||
| is_systemd_failure = unit_not_found in stderr_str or EXT_SIGNATURE_VALIDATION_CGROUPS_UNIT not in stderr_str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we share the same logic what we do for extensions-run? If we need to change at some point how we detect, we may miss or required to change in all places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a util function "is_systemd_failure()" in cgroupsapi.py, and used that function both in start_extension_command() and in validate_signature(). Let me know if this looks okay to you.
| message="'systemd-run' invocation failed for signature validation, falling back to direct execution. Error: '{0}'".format(ex.stderr), | ||
| name=name, version=version, duration=0) | ||
| # Run without systemd | ||
| run_command(base_command, encode_output=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already mentioned to you, just pointing out again to let others know. When we run directly, process now placed in agent cgroup and it will get agent limits. Today agent cpu at 50% which is matching what you are doing but quota may change later. We need to think about long term solution what we want to do.
Context: On the extension run, when systemd-run failures, we reset the quota and disable the cgroups feature in the agent. So that extensions run in the agent cgroup without limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed in our sync: we should reset the quota and disable the cgroups feature in the agent, in the case of systemd-run failures for signature validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly
34d6d0f to
01b182d
Compare
azurelinuxagent/ga/cgroupapi.py
Outdated
| EXTENSION_SLICE_PREFIX = "azure-vmextensions" | ||
|
|
||
|
|
||
| def is_systemd_failure(unit_name, stderr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to azurelinuxagent/common/osutil/systemd.py and rename to is_systemd_run_failure. Also, since now this is a public method, document how it decides whether it is a systemd-run error or not.
azurelinuxagent/ga/cgroupapi.py
Outdated
| if hasattr(stderr, 'seek') and hasattr(stderr, 'read'): | ||
| stderr.seek(0) | ||
| stderr_str = ustr(stderr.read(TELEMETRY_MESSAGE_MAX_LEN), encoding='utf-8', errors='backslashreplace') | ||
| elif isinstance(stderr, bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need both bytes and str?
azurelinuxagent/ga/cgroupapi.py
Outdated
| :param unit_name: The name of the systemd unit/scope | ||
| :param stderr: Error output as str, bytes, or file-like object | ||
| :return: True if this is a systemd failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| :return: True if this is a systemd failure | |
| :return: True if this is a systemd-run failure |
| '--slice={0}'.format(EXT_SIGNATURE_VALIDATION_SLICE), '--scope', '--property=CPUAccounting=yes', | ||
| '--property=CPUQuota={0}'.format(EXT_SIGNATURE_VALIDATION_CPU_QUOTA)] + base_command | ||
| try: | ||
| run_command(systemd_cmd, encode_output=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why encode_output=False?
|
|
||
| original_run_command = shellutil.run_command | ||
|
|
||
| def run_command_mock(cmd, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use the existing 'wraps' and 'call_list' instead of implementing your own, there are several samples in the code
also, we usually mock at the level of Popen, rather than run_command (protects against changes in run_command)
| return shellutil.run_command(cmd, *args, **kwargs) | ||
|
|
||
| with patch("azurelinuxagent.ga.signature_validation_util.run_command", side_effect=mock_run_command_with_error): | ||
| with self.assertRaises(SignatureValidationError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a message to this assert
| with patch("azurelinuxagent.ga.signature_validation_util.run_command", side_effect=run_command_with_systemd_failure): | ||
| validate_signature(self.vm_access_zip_path, self.vm_access_signature, self.package_name_and_version) | ||
|
|
||
| self.assertEqual(2, len(calls)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a message to this assert
|
|
||
| self.assertEqual(2, len(calls)) | ||
| # First command should be invoked via systemd-run | ||
| self.assertIn('systemd-run', ' '.join(calls[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a message to this assert
| # First command should be invoked via systemd-run | ||
| self.assertIn('systemd-run', ' '.join(calls[0])) | ||
| # Second command should be a direct openssl call (no systemd-run) | ||
| self.assertNotIn('systemd-run', ' '.join(calls[1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a message to this assert
| # Verify that cgroups were disabled | ||
| self.assertEqual(1, mock_instance.disable.call_count, "disable() should have been called exactly once") | ||
| reason = mock_instance.disable.call_args[1]['reason'] | ||
| self.assertIn("'systemd-run' invocation failed for signature validation", reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in several checks you do "foo in formatted_command" (or similar), could you be more strict and so "startswith" or command[0] == 'foo' (or similar)
maddieford
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No additional comments from my side. I'll give it another review after the open comments are resolved
Description
Issue #
This PR updates the extension signature validation code to run with a separate CPU quota (currently set to 50%), to improve validation performance and reduce impact on goal state processing time.
PR information
developbranch.Quality of Code and Contribution Guidelines