Skip to content

feat(discovery+openshell): DNS-AID agent discovery with OpenShell policy enforcement#469

Open
IngmarVG-IB wants to merge 4 commits intoNVIDIA:mainfrom
IngmarVG-IB:feat/dns-aid-discovery
Open

feat(discovery+openshell): DNS-AID agent discovery with OpenShell policy enforcement#469
IngmarVG-IB wants to merge 4 commits intoNVIDIA:mainfrom
IngmarVG-IB:feat/dns-aid-discovery

Conversation

@IngmarVG-IB
Copy link
Copy Markdown

@IngmarVG-IB IngmarVG-IB commented Mar 27, 2026

Summary

  • Adds DNS-AID agent discovery via SVCB records (RFC 9460) with private-use SvcParamKeys per draft-mozleywilliams-dnsop-dnsaid-01
  • Adds OpenShell caller-side policy enforcement (Layer 1) consuming dns-aid-core's PolicyDocument schema
  • Integrates both into the /v1/agents endpoint with feature gating

Story: "OpenShell secures the agent boundary; DNS-AID resolves what's inside it."

Motivation / Context

AICR agents in Kubernetes need a standard discovery layer to find peers and a security boundary to control which connections are permitted. DNS-AID provides discovery via DNS SVCB lookups; OpenShell evaluates target agents' published policy documents before allowing the calling agent to connect.

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • API server (/v1/agents endpoint)
  • Core libraries (pkg/discovery, pkg/openshell, pkg/errors, pkg/defaults)

Implementation Notes

DNS-AID Discovery (pkg/discovery/)

  • Discoverer: DNS SVCB queries for _{name}._{protocol}._agents.{domain}
  • Publisher: K8s ConfigMap-based agent registration with create-or-update semantics
  • TXT-based index at _agents.{domain} for listing all agents
  • Server lifecycle hooks (OnStart/OnShutdown) for auto-publish/deregister
  • Feature-gated via AICR_DISCOVERY_ENABLED=true

OpenShell Policy Enforcement (pkg/openshell/)

  • All 16 native policy rules matching dns-aid-core's schema
  • Three enforcement modes: strict (deny on violation), permissive (log + allow, default), disabled
  • Fail-open on policy fetch errors (matches dns-aid-core behavior)
  • SSRF-protected fetcher: HTTPS-only, rejects private/loopback IPs
  • TTL-based cache (5m default) with bounded eviction (max 1024 entries) and singleflight coalescing
  • Enforcement layer filtering — Layer 1 only evaluates applicable rules
  • Realm isolation detection with cross-realm access logging
  • CEL custom rules deferred to future phase (logged as unsupported)
  • Controlled via OPENSHELL_MODE env var with startup validation

Integration

  • /v1/agents evaluates OpenShell policy for each discovered agent with a policy URI
  • Guard accepts nil — zero behavioral change when discovery is disabled
  • New error codes: POLICY_DENIED, POLICY_FETCH
  • New defaults: PolicyFetchTimeout (3s), PolicyCacheTTL (5m), PolicyMaxBytes (64KB)

Key Design Decisions

  • Fail-open: If a policy document can't be fetched, the connection is allowed. This matches dns-aid-core and is correct for a discovery system where policy is advisory at Layer 1.
  • No Python dependency: OpenShell is a pure Go reimplementation consuming the same JSON schema as dns-aid-core's Python SDK.
  • Numeric TLS version comparison: Parses major.minor as integers (not lexicographic) to correctly handle versions like 1.12.

Testing

make test  # all tests pass with -race
  • pkg/openshell: 84.2% coverage (30 tests)
    • Evaluator: 15 table-driven tests covering all 16 rules, layer filtering, multiple violations
    • Fetcher: SSRF validation, caching, content-type/size rejection, non-200 handling
    • Guard: all 3 modes, fail-open, compliant caller, mode validation
    • Availability: midnight wrap, timezone handling, malformed input fail-open
    • TLS version: numeric comparison including double-digit minor versions
  • pkg/discovery: Mock DNS server tests for SVCB parsing, index listing, publisher
  • pkg/api: Handler tests with nil guard, domain validation, empty index

Risk Assessment

Medium — New packages with no existing consumers beyond the /v1/agents endpoint. Feature-gated behind AICR_DISCOVERY_ENABLED and OPENSHELL_MODE env vars, so zero impact on existing functionality when disabled. Default mode is permissive (log-only), providing a safe rollout path.

Checklist

  • Tests pass locally with -race
  • Linter passes (go vet clean, gofmt clean)
  • No tests skipped or disabled
  • New tests added for new functionality
  • Changes follow existing patterns (functional options, pkg/errors, slog, pkg/defaults)
  • Commits are cryptographically signed (git commit -S)

Introduce pkg/discovery for DNS-AID based agent discovery using SVCB
records (RFC 9460) with private-use SvcParamKeys per
draft-mozleywilliams-dnsop-dnsaid-01. Agents publish themselves via K8s
ConfigMaps and discover peers through DNS lookups.

- pkg/discovery: Discoverer (DNS SVCB resolver), Publisher (K8s
  ConfigMap-based registration), SVCB parser for keys 65400-65408
- pkg/server: WithOnStart/WithOnShutdown lifecycle hooks as functional
  options, executed during server start/shutdown respectively
- pkg/api: /v1/agents endpoint listing discovered agents in a domain,
  with input validation and feature-gated via AICR_DISCOVERY_ENABLED
- pkg/defaults: discovery timeout constants and ServerDefaultPort
@IngmarVG-IB IngmarVG-IB requested a review from a team as a code owner March 27, 2026 23:10
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

Welcome to AICR, @IngmarVG-IB! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

Strip rdatapolicy plugin mentions from pkg/discovery doc comments
and the k8s-discovery agent definition. These features are under
embargo and should not be referenced in public code.
OpenShell evaluates target agents' policy documents (served at their
dns-aid-core policy URI, SvcParamKey 65403) as Layer 1 caller-side
enforcement before allowing connections. This completes the security
story: "OpenShell secures the agent boundary; DNS-AID resolves what's
inside it."

New pkg/openshell package implements:
- All 16 native policy rules matching dns-aid-core's PolicyDocument schema
- Three enforcement modes: strict (deny), permissive (log, default), disabled
- Fail-open on policy fetch errors (matches dns-aid-core behavior)
- SSRF-protected fetcher with HTTPS-only, private IP rejection
- TTL-based cache with bounded eviction and singleflight coalescing
- Enforcement layer filtering (Layer 0/1/2 per rule)
- Realm isolation detection with cross-realm access logging

Integration:
- /v1/agents endpoint evaluates OpenShell policy per discovered agent
- OPENSHELL_MODE env var controls enforcement (strict/permissive/disabled)
- Guard accepts nil safely — zero behavioral change when discovery disabled

New error codes: POLICY_DENIED, POLICY_FETCH
New defaults: PolicyFetchTimeout (3s), PolicyCacheTTL (5m), PolicyMaxBytes (64KB)

Test coverage: 84.2% (30 tests across evaluator, fetcher, guard)
@IngmarVG-IB IngmarVG-IB changed the title feat(discovery): add DNS-AID agent discovery and server lifecycle hooks feat(discovery+openshell): DNS-AID agent discovery with OpenShell policy enforcement Mar 28, 2026
@mchmarny mchmarny requested a review from lockwobr April 2, 2026 11:25
Signed-off-by: Mark Chmarny <mchmarny@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants