Add CSI registry allow list check to admission webhook#48011
Add CSI registry allow list check to admission webhook#48011
Conversation
When using CSI-based library injection, the admission webhook now checks the registry allow list before adding CSI volumes to pods. If a library's registry is not in the allow list, the webhook skips adding the CSI volume entirely. This provides defense in depth alongside the CSI driver's own registry check — the webhook gives clean UX (no unnecessary volumes), while the CSI driver acts as a security backstop. The config key admission_controller.auto_instrumentation.csi_registry_allow_list is set via the Helm chart from the same value used for the CSI driver's DD_REGISTRY_ALLOW_LIST.
Files inventory check summaryFile checks results against ancestor 1eed9e23: Results for datadog-agent_7.79.0~devel.git.285.683994c.pipeline.105039631-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
On-wire sizes (compressed)
|
6dfd81c to
321c129
Compare
Use ParseEnvAsStringSlice (the established pattern in the codebase) to register an env var transformer that splits comma-separated values. This replaces the ad-hoc splitStringSlice wrapper and is consistent with how other string slice configs handle the Viper limitation (e.g. apm_config.features, apm_config.ignore_resources).
321c129 to
f46dad4
Compare
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: d3d4c0e Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +4.31 | [+1.25, +7.37] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +4.31 | [+1.25, +7.37] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | +0.71 | [-0.89, +2.30] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle | memory utilization | +0.56 | [+0.51, +0.61] | 1 | Logs bounds checks dashboard |
| ➖ | file_tree | memory utilization | +0.35 | [+0.29, +0.40] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.22 | [+0.07, +0.36] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | +0.08 | [-0.08, +0.24] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.06 | [-0.46, +0.58] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.03 | [-0.40, +0.46] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.01 | [-0.20, +0.21] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.00 | [-0.20, +0.20] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.11, +0.11] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.02 | [-0.06, +0.02] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.06 | [-0.46, +0.35] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.09 | [-0.17, -0.01] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.18 | [-0.25, -0.11] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.29 | [-0.45, -0.12] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.30 | [-0.53, -0.08] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.40 | [-0.51, -0.29] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.51 | [-0.66, -0.35] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | -0.57 | [-0.63, -0.50] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | -0.60 | [-0.77, -0.42] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.63 | [-0.69, -0.57] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | -0.77 | [-1.01, -0.53] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 674 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 272.38MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 693 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.23GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.21GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 174.81MiB ≤ 175MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 494.24MiB ≤ 550MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 205.27MiB ≤ 220MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 364.58 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 417.75MiB ≤ 475MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
The allow-list check now lives in the admission webhook (mutator.go) and covers all injection modes (init-container, image-volume, CSI), not just CSI. Rename the config key accordingly.
Test three cases: - Empty allow list permits injection from any registry (backward compat) - Registry in allow list permits injection - Registry not in allow list blocks injection and sets error annotation
…-allow-list-webhook
Replace global.containerRegistryAllowList approach with setting DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_CONTAINER_REGISTRY_ALLOW_LIST directly via clusterAgent.env, so the test works with the current helm chart version without needing the helm-charts PR to merge first. Two test cases: - TestRegistryAllowListBlocked: allow list = fake.registry.invalid, injection is blocked and error annotation is set - TestRegistryAllowListAllowed: allow list = registry.datadoghq.com (the injector's registry), injection proceeds and traces arrive
test/new-e2e/tests/ssi/ssi_test.go
Outdated
| }) | ||
| } | ||
|
|
||
| func (v *ssiSuite) TestRegistryAllowListBlocked() { |
There was a problem hiding this comment.
It would be great to merge both scenarios into a single test with two sub-tests, since setting up the stack takes about 5 minutes per scenario.
Or maybe InjectionAllowedByAllowList could be tested as part of another existing scenario. I know it’' not perfect, but I think performance matters, and we should try to limit the number of scenarios.
There was a problem hiding this comment.
Ahh, yeah, for sure. I didn't realize an UpdateEnv caused a full stack setup.
These tests should me mergeable, let me do that.
Deploy two apps in the same cluster with allow list = registry.datadoghq.com: - registry-allow-list-allowed: uses default injector, injection proceeds - registry-allow-list-blocked: pod annotation overrides injector image to fake.registry.invalid (not in allow list), injection is blocked This avoids a second UpdateEnv call (and second cluster setup) by using the admission.datadoghq.com/apm-inject.custom-image annotation to point one pod at an injector registry that is not in the allow list.
The previous check only validated the injector image registry. A user could bypass the allow list by annotating a pod with admission.datadoghq.com/python-lib.custom-image pointing to an arbitrary registry. Now InjectAPMLibraries checks every library's registry against the allow list in addition to the injector's registry. Adds a unit test for the library registry case and a third e2e scenario (LibraryRegistryBlockedByAllowList) that uses a python-lib.custom-image annotation pointing to fake.registry.invalid to verify the check fires.
What Does This Do
Adds a registry allow list for CSI-based APM library injection. When the CSI injection mode is used, the admission controller webhook now checks the configured registry allow list before adding CSI volumes to pods. If a library's registry is not permitted, the webhook skips the CSI volume entirely — the pod starts cleanly without blocked volumes.
This complements the CSI driver's own registry check (see DataDog/datadog-csi-driver#36). The webhook provides clean UX (no unnecessary volumes added), while the CSI driver acts as a security backstop (returns success on rejection so pods aren't blocked).
How to Test
The Helm chart (DataDog/helm-charts, branch
knusbaum/csi-driver-registry-allow-list) passesdatadog-csi-driver.apm.registryAllowListto the cluster-agent asDD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_CSI_REGISTRY_ALLOW_LIST.I have several
injector-devscenarios, yet to be publishedRelated PRs
Jira