Conversation
Go Package Import DifferencesBaseline: a6298b3
|
Files inventory check summaryFile checks results against ancestor 5f07fe85: Results for datadog-agent_7.78.0~devel.git.767.1659ddd.pipeline.103728784-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
29 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: a6298b3 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -4.40 | [-7.39, -1.40] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | quality_gate_logs | % cpu utilization | +1.47 | [-0.16, +3.10] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +1.20 | [+1.06, +1.34] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.64 | [+0.39, +0.89] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.47 | [+0.24, +0.69] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | +0.37 | [+0.21, +0.53] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.31 | [+0.27, +0.35] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_logs | memory utilization | +0.23 | [+0.17, +0.29] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | +0.12 | [+0.07, +0.18] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.03 | [-0.37, +0.42] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.11, +0.11] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.00 | [-0.19, +0.20] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.01 | [-0.06, +0.05] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.01 | [-0.20, +0.19] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | -0.02 | [-0.20, +0.15] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.03 | [-0.21, +0.14] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.03 | [-0.12, +0.05] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.04 | [-0.55, +0.47] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.06 | [-0.48, +0.36] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.21 | [-0.26, -0.16] | 1 | Logs bounds checks dashboard |
| ➖ | docker_containers_memory | memory utilization | -0.36 | [-0.43, -0.28] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.67 | [-0.81, -0.52] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.86 | [-0.97, -0.75] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -4.40 | [-7.39, -1.40] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 719 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 276.81MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 706 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.23GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.21GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 173.44MiB ≤ 175MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 2 ≤ 3 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 494.94MiB ≤ 550MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 203.59MiB ≤ 220MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 367.43 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 407.76MiB ≤ 475MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
0f93f0b to
5d620f6
Compare
Introduces the dogtelextension OTel Collector extension and refactors otel-agent startup to support standalone mode (DD_OTEL_STANDALONE=true), enabling the otel-agent to run independently without a core Datadog Agent. Key changes: - dogtelextension (comp/otelcol/dogtelextension): New OTel Collector extension providing a tagger gRPC server, host metadata submission, and secrets resolution for standalone mode. - Standalone/connected FX split (cmd/otel-agent/subcommands/run): Refactors otel-agent startup into commonAgentFxOptions plus mode- specific standaloneAgentFxOptions / connectedAgentFxOptions. Standalone mode wires local hostname, real secrets backend, local tagger, host metadata runner, and disables on-init config sync. Connected mode keeps remote hostname, remote tagger, and core-agent config sync. - K8s tag enrichment (comp/core/workloadmeta/collectors/catalog-otel): New catalog-otel workloadmeta catalog (kubelet, containerd, docker, ECS, crio, podman) compiled into otel-agent via the new kubelet build tag. In standalone mode the infraattributes processor enriches spans, metrics, and logs with K8s tags (kube_deployment, kube_namespace, pod_name, etc.) via the local tagger. Deployments require DD_KUBERNETES_KUBELET_HOST=status.hostIP, DD_KUBELET_TLS_VERIFY=false (or CA cert), and nodes/proxy RBAC on the otel-agent ServiceAccount for K8s tag enrichment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
0dcb28d to
3d7b219
Compare
truthbk
left a comment
There was a problem hiding this comment.
Super clean bootstrap! Also love how you were able to bring in the best of both worlds with fx + actual otel extension interfaces; and that resolves the extension configuration issue very cleanly. This is awesome.
We have to talk about what the otel-agent should default to, but this is a great start.
| ) | ||
| } | ||
|
|
||
| if acfg.GetBool("otel_standalone") { |
There was a problem hiding this comment.
So, I have some doubts with this: should we instead consider a check on otel_bundled? Or !acfg.GetBool("otel_standalone")?
On one hand this is better because it's backward compatible with our operator and helm charts. On the other it's not ideal because we'd have to set an env var when deploying with the otel operator/helm. We really do want to make a strong attempt to minimize the number of steps our OTel customers need to take on tooling we don't have full control over. Let's discuss this.
There was a problem hiding this comment.
Customers would have to set env vars in the otel operator/helm already, e.g. DD_OTELCOLLECTOR_ENABLED. Setting one more env var is probably fine.
There was a problem hiding this comment.
I think we should optimize to minimize the number of options a customer needs to set on the OpenTelemetry operator/helm chart. I feel like we can get away with a lot more of that transparently on the DD side.
I'm fine with merging this as-is; but I also think there's chances we want to revisit this specifically.
…andalone mode
- Apply dogtelextension settings to DD agent pkgconfig only when
otel_standalone=true; connected mode leaves core agent config untouched.
- Make EnableMetadataCollection a *bool (like KubeletTLSVerify) so absence
preserves the agent default rather than forcing false.
- Add MetadataInterval default (1800 s) to comment.
- Gate standalone block with pkgconfig.GetBool("otel_standalone").
- Add TestDogtelExtensionConfig_ConnectedModeIgnored to assert dogtelextension
fields are no-ops in connected mode.
- Tests use DD_OTEL_STANDALONE=true env var for standalone test cases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
60768d2 to
4c5b322
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4c5b322a08
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for name, val := range extensions { | ||
| if !strings.HasPrefix(name, "dogtel") { | ||
| continue | ||
| } | ||
| extcfg := &dogtelextensionimpl.Config{} |
There was a problem hiding this comment.
Pick dogtel extension config deterministically
This loop returns the first dogtel* entry encountered in a Go map, but map iteration order is randomized, so configs with multiple dogtel instances (for example dogtel plus dogtel/custom) can apply different overrides across runs. In standalone mode that can silently switch hostname/secrets/kubelet settings to the wrong extension instance, especially when only one instance is actually enabled in service.extensions.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This should be a singleton so we should be OK, but this is a potentially true concern for bad manual configuration. Maybe we should log something at the DEBUG level to reflect explicitly what extension instance defined in the config is being used.
There was a problem hiding this comment.
There should be at most one dogtel extension in the config, otherwise it leads to undeterministic behavior. I added a check to error out when there are multiple dogtel extensions
jeremy-hanna
left a comment
There was a problem hiding this comment.
👍 for agent-runtime owned files
truthbk
left a comment
There was a problem hiding this comment.
Looks good to me. Added a couple of nits you can feel free to ignore. I do think for the actual standalone vs connected default path we may have to make some changes, but we can do that later once we take on the deployment question more specifically. At that point we'll have a better understanding of what's better.
| for name, val := range extensions { | ||
| if !strings.HasPrefix(name, "dogtel") { | ||
| continue | ||
| } | ||
| extcfg := &dogtelextensionimpl.Config{} |
There was a problem hiding this comment.
This should be a singleton so we should be OK, but this is a potentially true concern for bad manual configuration. Maybe we should log something at the DEBUG level to reflect explicitly what extension instance defined in the config is being used.
| ) | ||
| } | ||
|
|
||
| if acfg.GetBool("otel_standalone") { |
There was a problem hiding this comment.
I think we should optimize to minimize the number of options a customer needs to set on the OpenTelemetry operator/helm chart. I feel like we can get away with a lot more of that transparently on the DD side.
I'm fine with merging this as-is; but I also think there's chances we want to revisit this specifically.
…er stream subscribers - getDogtelExtensionConfig now returns an error when multiple dogtel* extension entries are found instead of silently picking one - stopTaggerServer replaces unbounded GracefulStop() with a 5-second timeout that falls back to Stop(), preventing long-lived TaggerStreamEntities subscribers from blocking otel-agent termination - Add unit tests for both changes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 90662a7436
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…ders list
Setting metadata_interval in the dogtel extension config was replacing
metadata_providers wholesale with a single {name: host} entry, silently
dropping any other providers (e.g. "resources") configured in datadog.yaml.
Read the existing providers first, update the host entry in place (or
append it if absent), then write back the merged list. Handle both
map[string]interface{} and the map[interface{}]interface{} type that YAML
v2 produces for maps inside sequences.
Add a regression test that pre-seeds a "resources" provider in datadog.yaml
and asserts it survives alongside the updated host interval.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add TestFxRun_NoDatadogExporter_Standalone and its config fixture to cover the case where the otel-agent runs in standalone mode with no datadog exporter in the pipeline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
What does this PR do?
Adds standalone mode support to the otel-agent (DD_OTEL_STANDALONE=true) and introduces the dogtelextension OTel Collector extension for Datadog Agent functionalities.
Key changes:
Motivation
Standalone Dogtel Agent
Describe how you validated your changes
Additional Notes
Deployments using infraattributes in standalone mode require: