[logs] Implement Journaling Payload to Disk for Network Outages by angel-ddog · Pull Request #48143 · DataDog/datadog-agent

angel-ddog · 2026-03-20T19:35:06Z

What does this PR do?

Adds an opt-in disk retry mechanism to the logs sender. During network outages, when the HTTP/TCP destination enters its retry loop and the sender buffer fills up, payloads that would otherwise be silently dropped are now written to disk. When connectivity recovers, the payloads are replayed in FIFO order back through the normal send path.

This feature is disabled by default. Setting logs_config.disk_retry.max_size_bytes to a non-zero value enables it.

Motivation

Epic

During network slowdowns or complete outages, the logs pipeline drops payloads. The destination enters an infinite retry loop on the current payload, the DestinationSender buffer fills, and subsequent payloads are silently dropped. Customers lose log data with no recovery path. This change saves those payloads to disk and replays them when the network recovers.

Changes

New package: pkg/logs/sender/diskretry/

serialization.go: Binary payload serialization/deserialization with magic number, version header, and corruption detection
retrier.go: Retrier interface, DiskRetryManager (store, replay loop, disk capacity management, TTL expiry, startup reload), and noopRetrier for when disabled

Configuration

Key	Type	Default	Description
`logs_config.disk_retry.max_size_bytes`	int	`0`	Max disk space for retry files. `0` = disabled.
`logs_config.disk_retry.path`	string	`<run_path>/logs-retry`	Directory for retry files
`logs_config.disk_retry.max_disk_ratio`	float	`0.80`	Stop writing when filesystem usage exceeds this
`logs_config.disk_retry.file_ttl_days`	int	`7`	Remove retry files older than this

Describe how you validated your changes

Manual QA:

Validated with real Datadog intake via local TCP proxy (disk-retry-qa-real.sh): confirmed during-outage logs appear in Log Explorer after recovery

Script:

#!/bin/bash
#
# Disk Retry QA Script -- Real Datadog Intake via Local Proxy
#
# Runs a local TCP proxy (localhost:18443) that forwards to Datadog's intake.
# The agent is configured to send through this proxy. Stopping/starting the
# proxy simulates a clean network outage.
#
# Flow:
#   1. Start proxy, agent sends logs through it     -> verify in Log Explorer
#   2. Kill proxy (simulates outage)                 -> payloads go to disk
#   3. Stop log writer, let pipeline drain
#   4. Restart proxy                                 -> payloads replay to Datadog
#
# Prerequisites:
#   - Build the agent: dda inv agent.build --build-exclude=systemd
#   - Valid API key in dev/dist/datadog.yaml
#   - datadog.yaml must set: logs_config.logs_dd_url: "localhost:18443"
#   - disk_retry config enabled
#
# Usage:
#   chmod +x dev/disk-retry-qa-real.sh
#   ./dev/disk-retry-qa-real.sh

AGENT_BIN="./bin/agent/agent"
LOG_FILE="/tmp/test-disk-retry-real.log"
RETRY_DIR="/tmp/dd-logs-retry"
PROXY_PORT=18443
INTAKE_HOST="agent-http-intake.logs.datadoghq.com"
INTAKE_PORT=443
LOG_WRITER_PID=""
PROXY_PID=""

RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
CYAN='\033[0;36m'
NC='\033[0m'

cleanup() {
    echo -e "\n${YELLOW}Cleaning up...${NC}"
    [ -n "$LOG_WRITER_PID" ] && kill "$LOG_WRITER_PID" 2>/dev/null
    [ -n "$PROXY_PID" ] && kill "$PROXY_PID" 2>/dev/null
    sleep 1
    echo "Done."
}
trap cleanup EXIT

# ── Helpers ──────────────────────────────────────────────────────────

start_proxy() {
    # TCP proxy: accepts TLS connections on localhost:18443, forwards to Datadog intake.
    # Uses socat if available, falls back to python.
    if command -v socat &>/dev/null; then
        socat TCP-LISTEN:$PROXY_PORT,fork,reuseaddr OPENSSL:$INTAKE_HOST:$INTAKE_PORT,verify=0 &
        PROXY_PID=$!
    else
        python3 -c "
import socket, ssl, threading, sys

def forward(src, dst):
    try:
        while True:
            data = src.recv(65536)
            if not data:
                break
            dst.sendall(data)
    except:
        pass
    finally:
        src.close()
        dst.close()

def handle(client):
    ctx = ssl.create_default_context()
    upstream = ctx.wrap_socket(socket.socket(), server_hostname='$INTAKE_HOST')
    upstream.connect(('$INTAKE_HOST', $INTAKE_PORT))
    t1 = threading.Thread(target=forward, args=(client, upstream), daemon=True)
    t2 = threading.Thread(target=forward, args=(upstream, client), daemon=True)
    t1.start()
    t2.start()
    t1.join()
    t2.join()

server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server.bind(('127.0.0.1', $PROXY_PORT))
server.listen(32)
print(f'TCP proxy listening on :$PROXY_PORT -> $INTAKE_HOST:$INTAKE_PORT', flush=True)
while True:
    client, addr = server.accept()
    threading.Thread(target=handle, args=(client,), daemon=True).start()
" &
        PROXY_PID=$!
    fi
    sleep 2
    echo -e "${GREEN}Proxy started on :$PROXY_PORT -> $INTAKE_HOST:$INTAKE_PORT${NC}"
}

stop_proxy() {
    if [ -n "$PROXY_PID" ]; then
        kill "$PROXY_PID" 2>/dev/null
        sleep 1
        PROXY_PID=""
    fi
}

start_log_writer() {
    local tag=$1
    (
        i=0
        while true; do
            echo "{\"message\": \"disk-retry-qa $tag line $i\", \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\", \"phase\": \"$tag\"}" >> "$LOG_FILE"
            i=$((i + 1))
            sleep 1
        done
    ) &
    LOG_WRITER_PID=$!
}

stop_log_writer() {
    if [ -n "$LOG_WRITER_PID" ]; then
        kill "$LOG_WRITER_PID" 2>/dev/null
        sleep 1
        LOG_WRITER_PID=""
    fi
}

retry_file_count() {
    if [ -d "$RETRY_DIR" ]; then
        sudo find "$RETRY_DIR" -name "*.retry" 2>/dev/null | wc -l | tr -d ' '
    else
        echo "0"
    fi
}

retry_dir_size() {
    if [ -d "$RETRY_DIR" ]; then
        sudo du -sh "$RETRY_DIR" 2>/dev/null | cut -f1
    else
        echo "0"
    fi
}

# ── Main ─────────────────────────────────────────────────────────────

echo "============================================"
echo "  Disk Retry QA -- Real Datadog via Proxy"
echo "============================================"
echo ""

# Clean state
rm -f "$LOG_FILE" 2>/dev/null
sudo rm -rf "$RETRY_DIR" 2>/dev/null
touch "$LOG_FILE"

# Check agent binary
if [ ! -x "$AGENT_BIN" ]; then
    echo -e "${RED}Agent binary not found at $AGENT_BIN${NC}"
    echo "Build it first: dda inv agent.build --build-exclude=systemd"
    exit 1
fi

# ── Phase 1: Normal operation ────────────────────────────────────────
echo -e "${GREEN}=== Phase 1: Normal operation ===${NC}"
echo "Starting proxy..."
start_proxy

echo "Starting log writer (1 line/sec, tagged phase=pre-outage)..."
start_log_writer "pre-outage"

echo ""
echo -e "${YELLOW}Start the agent in another terminal:${NC}"
echo ""
echo "  cd $(pwd)"
echo "  sudo $AGENT_BIN run -c dev/dist/datadog.yaml"
echo ""
echo "Check Datadog Log Explorer for: service:disk-retry-test"
echo ""
read -p "Press ENTER once the agent is running and you see logs in Datadog... "

echo ""
echo -e "${CYAN}Retry dir: $(retry_file_count) files, $(retry_dir_size)${NC}"

# ── Phase 2: Simulate outage (45 seconds) ───────────────────────────
echo ""
echo -e "${RED}=== Phase 2: Killing proxy (45s outage) ===${NC}"
stop_log_writer
start_log_writer "during-outage"

stop_proxy
echo "Proxy killed. Agent cannot reach Datadog."
echo "Logs tagged 'during-outage' being written. Payloads should go to disk."
echo ""

for i in $(seq 1 45); do
    count=$(retry_file_count)
    size=$(retry_dir_size)
    printf "  [%2ds] retry files: %-6s disk usage: %s\n" "$i" "$count" "$size"
    sleep 1
done

PHASE2_FINAL=$(retry_file_count)
PHASE2_SIZE=$(retry_dir_size)
echo ""
if [ "$PHASE2_FINAL" -gt 0 ] 2>/dev/null; then
    echo -e "${GREEN}Phase 2 result: $PHASE2_FINAL files, $PHASE2_SIZE on disk${NC}"
else
    echo -e "${RED}Phase 2 result: NO retry files -- check agent logs${NC}"
fi

# ── Phase 3: Stop log writer ─────────────────────────────────────────
echo ""
echo -e "${YELLOW}=== Phase 3: Stopping log writer ===${NC}"
stop_log_writer
echo "Waiting 15s for pipeline to flush to disk..."
echo ""

for i in $(seq 1 15); do
    count=$(retry_file_count)
    size=$(retry_dir_size)
    printf "  [%2ds] retry files: %-6s disk usage: %s\n" "$i" "$count" "$size"
    sleep 1
done

PRE_RECOVERY=$(retry_file_count)
PRE_RECOVERY_SIZE=$(retry_dir_size)
echo ""
echo -e "${CYAN}Files before recovery: $PRE_RECOVERY ($PRE_RECOVERY_SIZE)${NC}"

# ── Phase 4: Recovery ────────────────────────────────────────────────
echo ""
echo -e "${GREEN}=== Phase 4: Restarting proxy + new logs ===${NC}"
start_log_writer "post-recovery"
start_proxy

echo "Watching for drain (5 minutes max)..."
echo ""

LAST_COUNT=$PRE_RECOVERY
for i in $(seq 1 300); do
    count=$(retry_file_count)
    size=$(retry_dir_size)
    printf "  [%3ds] retry files: %-6s disk usage: %s\n" "$i" "$count" "$size"

    if [ "$count" = "0" ]; then
        echo ""
        echo -e "  ${GREEN}All retry files replayed at ${i}s!${NC}"
        break
    fi

    if [ "$count" -lt "$LAST_COUNT" ] 2>/dev/null; then
        echo -e "  ${CYAN}  ^ files decreasing -- replay active${NC}"
    fi
    LAST_COUNT=$count

    sleep 1
done

# ── Summary ──────────────────────────────────────────────────────────
echo ""
echo "============================================"
echo "  QA Summary"
echo "============================================"
FINAL_COUNT=$(retry_file_count)
FINAL_SIZE=$(retry_dir_size)
echo "  Phase 2 (outage):   $PHASE2_FINAL files ($PHASE2_SIZE)"
echo "  Pre-recovery:       $PRE_RECOVERY files ($PRE_RECOVERY_SIZE)"
echo "  Final:              $FINAL_COUNT files ($FINAL_SIZE)"
echo ""
if [ "$PHASE2_FINAL" -gt 0 ] 2>/dev/null && [ "$FINAL_COUNT" -lt "$PRE_RECOVERY" ] 2>/dev/null; then
    echo -e "  ${GREEN}PASS: Payloads stored and replaying to Datadog${NC}"
elif [ "$FINAL_COUNT" = "0" ]; then
    echo -e "  ${GREEN}PASS: All payloads replayed to Datadog${NC}"
else
    echo -e "  ${YELLOW}PARTIAL: Replay in progress, may need more time${NC}"
fi
echo ""
echo "Check Datadog Log Explorer for all three phases:"
echo "  service:disk-retry-test @phase:pre-outage"
echo "  service:disk-retry-test @phase:during-outage  (replayed from disk)"
echo "  service:disk-retry-test @phase:post-recovery"
echo ""
echo "You can now stop the agent."

Screenshots of Local Output (The replayed payloads appeared on the Logs Explorer as well):

This was my datadog.yaml:

Additional Notes

Replayed payloads use a minimal Origin with empty Identifier so the auditor safely skips registry updates without panicking
Files persist across agent restarts and are reloaded on startup
Replay throughput is currently bounded by the worker queue size; optimization is planned as a follow-up

agent-platform-auto-pr · 2026-03-20T19:43:55Z

Go Package Import Differences

Baseline: a1f24cb
Comparison: 3fa04a4

binary	os	arch	change
agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
agent	windows	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
agent	darwin	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
agent	darwin	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
iot-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
iot-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
heroku-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
cluster-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
cluster-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
cluster-agent-cloudfoundry	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
cluster-agent-cloudfoundry	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
dogstatsd	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
dogstatsd	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
process-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
process-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
process-agent	windows	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
process-agent	darwin	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
process-agent	darwin	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
heroku-process-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
security-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
security-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
security-agent	windows	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
system-probe	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
system-probe	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
system-probe	windows	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
system-probe	darwin	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
system-probe	darwin	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
otel-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
otel-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
privateactionrunner	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
privateactionrunner	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
privateactionrunner	windows	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
privateactionrunner	darwin	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
privateactionrunner	darwin	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry

agent-platform-auto-pr · 2026-03-20T19:59:02Z

Files inventory check summary

File checks results against ancestor a1f24cbd:

Results for datadog-agent_7.79.0~devel.git.170.3fa04a4.pipeline.104831942-1_amd64.deb:

No change detected

…gs-to-disk merged main into my branch

… with this.

agent-platform-auto-pr · 2026-03-25T17:48:01Z

Static quality checks

❌ Please find below the results from static quality gates
Comparison made with ancestor a1f24cb
📊 Static Quality Gates Dashboard
🔗 SQG Job

Error

	Quality gate	Change	Size (prev → curr → max)
❌	iot_agent_deb_amd64 (on disk)	+32.03 KiB (0.07% increase)	43.270 → 43.301 → 43.290
❌	iot_agent_rpm_amd64 (on disk)	+32.03 KiB (0.07% increase)	43.270 → 43.301 → 43.290
❌	iot_agent_suse_amd64 (on disk)	+32.03 KiB (0.07% increase)	43.270 → 43.301 → 43.290

Gate failure full details

Quality gate	Error type	Error message
iot_agent_deb_amd64	StaticQualityGateFailed	static_quality_gate_iot_agent_deb_amd64 failed! Disk size 43.3 MB exceeds limit of 43.3 MB by 11.2 KB
iot_agent_rpm_amd64	StaticQualityGateFailed	static_quality_gate_iot_agent_rpm_amd64 failed! Disk size 43.3 MB exceeds limit of 43.3 MB by 11.7 KB
iot_agent_suse_amd64	StaticQualityGateFailed	static_quality_gate_iot_agent_suse_amd64 failed! Disk size 43.3 MB exceeds limit of 43.3 MB by 11.7 KB

Static quality gates prevent the PR to merge!
You can check the static quality gates confluence page for guidance. We also have a toolbox page available to list tools useful to debug the size increase.

Successful checks

Info

	Quality gate	Change	Size (prev → curr → max)
✅	agent_deb_amd64	+179.38 KiB (0.02% increase)	752.502 → 752.677 → 753.380
✅	agent_deb_amd64_fips	+135.44 KiB (0.02% increase)	709.528 → 709.660 → 713.900
✅	agent_heroku_amd64	+32.03 KiB (0.01% increase)	313.289 → 313.320 → 320.580
✅	agent_rpm_amd64	+179.38 KiB (0.02% increase)	752.486 → 752.661 → 753.350
✅	agent_rpm_amd64_fips	+135.44 KiB (0.02% increase)	709.512 → 709.644 → 713.880
✅	agent_rpm_arm64	+164.78 KiB (0.02% increase)	730.931 → 731.092 → 735.290
✅	agent_rpm_arm64_fips	+124.65 KiB (0.02% increase)	690.966 → 691.088 → 696.840
✅	agent_suse_amd64	+179.38 KiB (0.02% increase)	752.486 → 752.661 → 753.350
✅	agent_suse_amd64_fips	+135.44 KiB (0.02% increase)	709.512 → 709.644 → 713.880
✅	agent_suse_arm64	+164.78 KiB (0.02% increase)	730.931 → 731.092 → 735.290
✅	agent_suse_arm64_fips	+124.65 KiB (0.02% increase)	690.966 → 691.088 → 696.840
✅	docker_agent_amd64	+179.38 KiB (0.02% increase)	812.805 → 812.980 → 815.700
✅	docker_agent_arm64	+164.77 KiB (0.02% increase)	816.020 → 816.181 → 821.970
✅	docker_agent_jmx_amd64	+179.38 KiB (0.02% increase)	1003.720 → 1003.895 → 1006.580
✅	docker_agent_jmx_arm64	+164.77 KiB (0.02% increase)	995.714 → 995.875 → 1001.570
✅	docker_cluster_agent_amd64	+32.03 KiB (0.02% increase)	203.930 → 203.961 → 206.270
✅	docker_cluster_agent_arm64	+64.03 KiB (0.03% increase)	218.358 → 218.420 → 220.000
✅	docker_dogstatsd_amd64	+32.0 KiB (0.08% increase)	39.238 → 39.270 → 39.380
✅	docker_dogstatsd_arm64	+64.0 KiB (0.17% increase)	37.445 → 37.507 → 37.940
✅	dogstatsd_deb_amd64	+32.0 KiB (0.10% increase)	29.881 → 29.913 → 30.610
✅	dogstatsd_deb_arm64	+36.0 KiB (0.13% increase)	28.030 → 28.065 → 29.110
✅	dogstatsd_rpm_amd64	+32.0 KiB (0.10% increase)	29.881 → 29.913 → 30.610
✅	dogstatsd_suse_amd64	+32.0 KiB (0.10% increase)	29.881 → 29.913 → 30.610
✅	iot_agent_deb_arm64	+28.03 KiB (0.07% increase)	40.320 → 40.348 → 40.920
✅	iot_agent_deb_armhf	+32.02 KiB (0.08% increase)	41.064 → 41.095 → 41.100

2 successful checks with minimal change (< 2 KiB)

	Quality gate	Current Size
✅	docker_cws_instrumentation_amd64	7.142 MiB
✅	docker_cws_instrumentation_arm64	6.689 MiB

On-wire sizes (compressed)

	Quality gate	Change	Size (prev → curr → max)
❌	iot_agent_deb_amd64	+10.82 KiB (0.09% increase)	11.399 → 11.409 → 12.040
❌	iot_agent_rpm_amd64	+5.07 KiB (0.04% increase)	11.419 → 11.424 → 12.060
❌	iot_agent_suse_amd64	+5.07 KiB (0.04% increase)	11.419 → 11.424 → 12.060
✅	agent_deb_amd64	+60.77 KiB (0.03% increase)	174.739 → 174.798 → 178.360
✅	agent_deb_amd64_fips	-5.15 KiB (0.00% reduction)	165.337 → 165.332 → 172.790
✅	agent_heroku_amd64	neutral	75.004 MiB → 79.970
✅	agent_rpm_amd64	+49.93 KiB (0.03% increase)	177.517 → 177.566 → 181.830
✅	agent_rpm_amd64_fips	+64.22 KiB (0.04% increase)	167.570 → 167.632 → 173.370
✅	agent_rpm_arm64	+142.7 KiB (0.09% increase)	159.463 → 159.602 → 163.060
✅	agent_rpm_arm64_fips	+110.02 KiB (0.07% increase)	151.329 → 151.436 → 156.170
✅	agent_suse_amd64	+49.93 KiB (0.03% increase)	177.517 → 177.566 → 181.830
✅	agent_suse_amd64_fips	+64.22 KiB (0.04% increase)	167.570 → 167.632 → 173.370
✅	agent_suse_arm64	+142.7 KiB (0.09% increase)	159.463 → 159.602 → 163.060
✅	agent_suse_arm64_fips	+110.02 KiB (0.07% increase)	151.329 → 151.436 → 156.170
✅	docker_agent_amd64	+57.2 KiB (0.02% increase)	268.097 → 268.152 → 272.480
✅	docker_agent_arm64	+116.18 KiB (0.04% increase)	255.266 → 255.379 → 261.060
✅	docker_agent_jmx_amd64	+63.1 KiB (0.02% increase)	336.744 → 336.806 → 341.100
✅	docker_agent_jmx_arm64	+98.83 KiB (0.03% increase)	319.907 → 320.004 → 325.620
✅	docker_cluster_agent_amd64	+12.69 KiB (0.02% increase)	71.366 → 71.378 → 72.920
✅	docker_cluster_agent_arm64	+9.24 KiB (0.01% increase)	66.996 → 67.005 → 68.220
✅	docker_cws_instrumentation_amd64	neutral	2.999 MiB → 3.330
✅	docker_cws_instrumentation_arm64	neutral	2.729 MiB → 3.090
✅	docker_dogstatsd_amd64	+11.38 KiB (0.07% increase)	15.173 → 15.184 → 15.820
✅	docker_dogstatsd_arm64	+14.53 KiB (0.10% increase)	14.487 → 14.502 → 14.830
✅	dogstatsd_deb_amd64	+6.76 KiB (0.08% increase)	7.894 → 7.901 → 8.790
✅	dogstatsd_deb_arm64	+10.4 KiB (0.15% increase)	6.777 → 6.787 → 7.710
✅	dogstatsd_rpm_amd64	+7.72 KiB (0.10% increase)	7.903 → 7.911 → 8.800
✅	dogstatsd_suse_amd64	+7.72 KiB (0.10% increase)	7.903 → 7.911 → 8.800
✅	iot_agent_deb_arm64	+8.72 KiB (0.09% increase)	9.702 → 9.710 → 10.450
✅	iot_agent_deb_armhf	+5.5 KiB (0.05% increase)	9.940 → 9.945 → 10.620

cit-pr-commenter-54b7da · 2026-03-25T18:10:55Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 8f0fd8b3-3867-4e24-9faf-bbadc4247f99

Baseline: a1f24cb
Comparison: 3fa04a4
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
❌	docker_containers_cpu	% cpu utilization	+6.86	[+3.78, +9.95]	1	Logs

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
❌	docker_containers_cpu	% cpu utilization	+6.86	[+3.78, +9.95]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	+1.28	[+1.12, +1.43]	1	Logs
➖	quality_gate_metrics_logs	memory utilization	+0.60	[+0.37, +0.83]	1	Logs bounds checks dashboard
➖	otlp_ingest_metrics	memory utilization	+0.54	[+0.38, +0.70]	1	Logs
➖	file_tree	memory utilization	+0.44	[+0.39, +0.50]	1	Logs
➖	docker_containers_memory	memory utilization	+0.41	[+0.32, +0.49]	1	Logs
➖	ddot_metrics	memory utilization	+0.19	[+0.02, +0.37]	1	Logs
➖	ddot_metrics_sum_delta	memory utilization	+0.05	[-0.12, +0.23]	1	Logs
➖	ddot_metrics_sum_cumulative	memory utilization	+0.04	[-0.10, +0.18]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	+0.01	[-0.38, +0.40]	1	Logs
➖	uds_dogstatsd_to_api_v3	ingress throughput	+0.01	[-0.20, +0.21]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	+0.00	[-0.10, +0.11]	1	Logs
➖	quality_gate_logs	% cpu utilization	+0.00	[-1.61, +1.62]	1	Logs bounds checks dashboard
➖	ddot_metrics_sum_cumulativetodelta_exporter	memory utilization	-0.00	[-0.23, +0.22]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	-0.01	[-0.10, +0.07]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.01	[-0.22, +0.19]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	-0.01	[-0.51, +0.48]	1	Logs
➖	otlp_ingest_logs	memory utilization	-0.03	[-0.13, +0.07]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	-0.04	[-0.48, +0.40]	1	Logs
➖	uds_dogstatsd_20mb_12k_contexts_20_senders	memory utilization	-0.08	[-0.13, -0.02]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	-0.10	[-0.14, -0.07]	1	Logs bounds checks dashboard
➖	quality_gate_idle	memory utilization	-0.32	[-0.38, -0.27]	1	Logs bounds checks dashboard
➖	ddot_logs	memory utilization	-0.38	[-0.44, -0.32]	1	Logs

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	observed_value	links
✅	docker_containers_cpu	simple_check_run	10/10	703 ≥ 26
✅	docker_containers_memory	memory_usage	10/10	273.55MiB ≤ 370MiB
✅	docker_containers_memory	simple_check_run	10/10	560 ≥ 26
✅	file_to_blackhole_0ms_latency	memory_usage	10/10	0.19GiB ≤ 1.20GiB
✅	file_to_blackhole_0ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10	0.23GiB ≤ 1.20GiB
✅	file_to_blackhole_1000ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_100ms_latency	memory_usage	10/10	0.20GiB ≤ 1.20GiB
✅	file_to_blackhole_100ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_500ms_latency	memory_usage	10/10	0.21GiB ≤ 1.20GiB
✅	file_to_blackhole_500ms_latency	missed_bytes	10/10	0B = 0B
✅	quality_gate_idle	intake_connections	10/10	3 = 3	bounds checks dashboard
✅	quality_gate_idle	memory_usage	10/10	173.61MiB ≤ 175MiB	bounds checks dashboard
✅	quality_gate_idle_all_features	intake_connections	10/10	3 = 3	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	495.82MiB ≤ 550MiB	bounds checks dashboard
✅	quality_gate_logs	intake_connections	10/10	4 ≤ 6	bounds checks dashboard
✅	quality_gate_logs	memory_usage	10/10	203.05MiB ≤ 220MiB	bounds checks dashboard
✅	quality_gate_logs	missed_bytes	10/10	0B = 0B	bounds checks dashboard
✅	quality_gate_metrics_logs	cpu_usage	10/10	358.52 ≤ 2000	bounds checks dashboard
✅	quality_gate_metrics_logs	intake_connections	10/10	4 ≤ 6	bounds checks dashboard
✅	quality_gate_metrics_logs	memory_usage	10/10	419.49MiB ≤ 475MiB	bounds checks dashboard
✅	quality_gate_metrics_logs	missed_bytes	10/10	0B = 0B	bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.

…gs-to-disk merging into my branch

…pping

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4fd249d52a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

pkg/logs/sender/worker.go

pkg/logs/sender/diskretry/retrier.go

pkg/logs/sender/diskretry/serialization.go

s-alad

lgtm for agent-config file

… added message count bound to prevent OOM

Implement Journaling Payload to Disk

d5942dc

github-actions bot added the long review PR is complex, plan time to review it label Mar 20, 2026

dd-octo-sts bot added internal Identify a non-fork PR team/agent-log-pipelines team/agent-configuration labels Mar 20, 2026

angel-ddog added this to the 7.79.0 milestone Mar 20, 2026

angel-ddog added 3 commits March 24, 2026 15:53

Fixed linter issues

6f461bc

Merge branch 'main' of github.com:DataDog/datadog-agent into angel/lo…

9895dbc

…gs-to-disk merged main into my branch

copyright linter error, and go mod lint fixed. Including reno release…

569c354

… with this.

angel-ddog added the qa/done QA done before merge and regressions are covered by tests label Mar 25, 2026

angel-ddog added 2 commits March 25, 2026 15:39

Merge branch 'main' of github.com:DataDog/datadog-agent into angel/lo…

ab419b4

…gs-to-disk merging into my branch

Bug Fix: Retried payloads now replay for dual shipping and single shi…

4fd249d

…pping

angel-ddog changed the title ~~Implement Journaling Payload to Disk~~ [logs] Implement Journaling Payload to Disk for Network Outages Mar 25, 2026

angel-ddog marked this pull request as ready for review March 25, 2026 20:23

angel-ddog requested review from a team as code owners March 25, 2026 20:23

angel-ddog requested a review from s-alad March 25, 2026 20:23

chatgpt-codex-connector bot reviewed Mar 25, 2026

View reviewed changes

pkg/logs/sender/worker.go Show resolved Hide resolved

pkg/logs/sender/diskretry/retrier.go Show resolved Hide resolved

pkg/logs/sender/diskretry/serialization.go Show resolved Hide resolved

s-alad approved these changes Mar 26, 2026

View reviewed changes

Fixed worker duplicate payload, added TTL on reload of live payloads,…

3fa04a4

… added message count bound to prevent OOM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[logs] Implement Journaling Payload to Disk for Network Outages#48143

[logs] Implement Journaling Payload to Disk for Network Outages#48143
angel-ddog wants to merge 7 commits intomainfrom
angel/logs-to-disk

angel-ddog commented Mar 20, 2026 •

edited

Loading

Uh oh!

agent-platform-auto-pr bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

agent-platform-auto-pr bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

agent-platform-auto-pr bot commented Mar 25, 2026 •

edited

Loading

Info

Uh oh!

cit-pr-commenter-54b7da bot commented Mar 25, 2026 •

edited

Loading

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

s-alad left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

angel-ddog commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Changes

Configuration

Describe how you validated your changes

Additional Notes

Uh oh!

agent-platform-auto-pr bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Go Package Import Differences

Uh oh!

agent-platform-auto-pr bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Files inventory check summary

Results for datadog-agent_7.79.0~devel.git.170.3fa04a4.pipeline.104831942-1_amd64.deb:

Uh oh!

agent-platform-auto-pr bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Static quality checks

Error

Info

Uh oh!

cit-pr-commenter-54b7da bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

CI Pass/Fail Decision

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

s-alad left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

angel-ddog commented Mar 20, 2026 •

edited

Loading

agent-platform-auto-pr bot commented Mar 20, 2026 •

edited

Loading

agent-platform-auto-pr bot commented Mar 20, 2026 •

edited

Loading

agent-platform-auto-pr bot commented Mar 25, 2026 •

edited

Loading

cit-pr-commenter-54b7da bot commented Mar 25, 2026 •

edited

Loading