diff --git a/components/egress/METRIC.md b/components/egress/METRIC.md new file mode 100644 index 00000000..239f1fdb --- /dev/null +++ b/components/egress/METRIC.md @@ -0,0 +1,89 @@ +# Egress Sidecar Metrics + +This document describes the Prometheus metrics exposed by the Egress Sidecar: name, type, description, and optional labels. +All metrics use the prefix `opensandbox_egress_*` and are exposed via HTTP at `GET /metrics` (same port as the policy server, default `:18080`). + +--- + +## 1. DNS Proxy (Layer 1) + +| Metric | Type | Description | Labels | +|--------|------|-------------|--------| +| `opensandbox_egress_dns_queries_total` | Counter | Total DNS queries handled by the proxy, by result. | `result`: `allowed` (policy allowed and forward succeeded), `denied` (policy denied, NXDOMAIN returned), `forward_error` (policy allowed but upstream DNS failed). | +| `opensandbox_egress_dns_forward_duration_seconds` | Histogram / Summary | Latency in seconds of forwarding DNS queries to upstream. | For Summary, `quantile`; for Histogram, default buckets. | + +--- + +## 2. Policy and Runtime + +| Metric | Type | Description | Labels | +|--------|------|-------------|--------| +| `opensandbox_egress_policy_updates_total` | Counter | Number of successful policy updates via `POST /policy`. | None. | +| `opensandbox_egress_policy_rule_count` | Gauge | Current number of egress rules in the active policy. | Optional: `default_action` (`allow` / `deny`). | +| `opensandbox_egress_enforcement_mode` | Gauge | Current enforcement mode for observability (OSEP R6). Value is 1; label distinguishes mode. | `mode`: `dns` (DNS proxy only) or `dns+nft` (DNS + nftables). | + +--- + +## 3. nftables (Layer 2, dns+nft mode) + +| Metric | Type | Description | Labels | +|--------|------|-------------|--------| +| `opensandbox_egress_nft_apply_total` | Counter | Number of nftables ApplyStatic (static rule apply) operations. | `result`: `success` or `failure`. On failure the sidecar falls back to DNS-only mode. | +| `opensandbox_egress_nft_resolved_ips_added_total` | Counter | Number of resolved IPs added to the nftables dynamic set (count of IPs or invocations, implementation-defined). | Optional: `domain` (use with care to avoid high cardinality). | +| `opensandbox_egress_nft_doh_dot_packets_dropped_total` | Counter | Number of packets dropped due to DoH/DoT blocking. | `reason`: `dot_853` (DoT port 853), `doh_443` (DoH over 443 when enabled). | + +--- + +## 4. Violations and Security (aligned with OSEP R7 / violation logging) + +| Metric | Type | Description | Labels | +|--------|------|-------------|--------| +| `opensandbox_egress_violations_total` | Counter | Number of policy denials (e.g. DNS NXDOMAIN). Can be instrumented alongside violation logs. | `type`: `dns_deny`; add e.g. `l2_deny` for L2 denials if implemented. | + +--- + +## 5. Process / Runtime (optional) + +| Metric | Type | Description | Labels | +|--------|------|-------------|--------| +| `opensandbox_egress_info` | Gauge | Constant 1; labels identify the instance and environment in Prometheus. | See "Instance identification" below: `instance_id` (recommended), `enforcement_mode`, `version`, etc. | +| `opensandbox_egress_uptime_seconds` | Gauge | Process uptime in seconds. | None. | + +--- + +## Instance identification (keeping metrics per container) + +Each sidecar container corresponds to a different sandbox; metrics must be distinguishable per instance and must not be mixed in the same time series. How this works depends on how metrics are collected: + +### Instance ID source + +Instance identification is **provided only via an environment variable**; the sidecar reads env and does not distinguish K8s vs Docker: + +- **Env var**: `OPENSANDBOX_EGRESS_INSTANCE_ID` +- **Meaning**: Unique ID for this sidecar instance (e.g. sandbox_id, pod name, container_id), **injected by the orchestrator when creating the container**. +- **Examples**: + - Kubernetes: set via Downward API in the Pod, e.g. `OPENSANDBOX_EGRESS_INSTANCE_ID=$(POD_NAME).$(POD_NAMESPACE)` or `$(POD_UID)`. + - Docker / OpenSandbox server: pass when creating the container, e.g. `-e OPENSANDBOX_EGRESS_INSTANCE_ID=`. + +Implementation notes: + +- Attach the **same set of instance labels** to all metrics: read `OPENSANDBOX_EGRESS_INSTANCE_ID` and use it as the `instance_id` label, consistent with `opensandbox_egress_info`. +- If the env is unset, `instance_id` may be empty or a fallback (e.g. hostname). **When using push, configuring it is strongly recommended**, or multiple instances will share the same grouping key. + +--- + +## Metric types + +- **Counter**: Monotonically increasing value; use for request counts, error counts, etc. Prometheus typically uses `rate()` / `increase()` for rate or delta. +- **Gauge**: Current value that can go up or down; use for current rule count, mode, uptime, etc. +- **Histogram**: Bucketed observations (e.g. latency); supports quantiles and rate. +- **Summary**: Quantiles computed in the application and exposed; use for distribution metrics like latency. + +--- + +## Exposure + +- **Endpoint**: Same port as the policy server, default `GET http:///metrics` (e.g. `http://127.0.0.1:18080/metrics`). +- **Format**: Prometheus text format (`text/plain; charset=utf-8`). +- **Collection**: Because the sidecar lifecycle is short, use short-interval scrape from the same Pod or push on exit/periodically (e.g. Pushgateway, OTLP). See [README](README.md) and observability notes. +- **Instance separation**: Metrics from different container instances are separated by the labels defined in "Instance identification" (e.g. `instance_id`) or by scrape target identity; see the "Instance identification" section above. diff --git a/components/egress/go.mod b/components/egress/go.mod index c7304491..15051334 100644 --- a/components/egress/go.mod +++ b/components/egress/go.mod @@ -4,12 +4,22 @@ go 1.24.0 require ( github.com/miekg/dns v1.1.61 - golang.org/x/sys v0.31.0 + github.com/prometheus/client_golang v1.23.2 + golang.org/x/sys v0.35.0 ) require ( + github.com/beorn7/perks v1.0.1 // indirect + github.com/cespare/xxhash/v2 v2.3.0 // indirect + github.com/kr/text v0.2.0 // indirect + github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect + github.com/prometheus/client_model v0.6.2 // indirect + github.com/prometheus/common v0.66.1 // indirect + github.com/prometheus/procfs v0.16.1 // indirect + go.yaml.in/yaml/v2 v2.4.2 // indirect golang.org/x/mod v0.18.0 // indirect - golang.org/x/net v0.38.0 // indirect - golang.org/x/sync v0.7.0 // indirect + golang.org/x/net v0.43.0 // indirect + golang.org/x/sync v0.13.0 // indirect golang.org/x/tools v0.22.0 // indirect + google.golang.org/protobuf v1.36.8 // indirect ) diff --git a/components/egress/go.sum b/components/egress/go.sum index 459f89cd..72638050 100644 --- a/components/egress/go.sum +++ b/components/egress/go.sum @@ -1,12 +1,56 @@ +github.com/beorn7/perks v1.0.1 h1:VlbKKnNfV8bJzeqoa4cOKqO6bYr3WgKZxO8Z16+hsOM= +github.com/beorn7/perks v1.0.1/go.mod h1:G2ZrVWU2WbWT9wwq4/hrbKbnv/1ERSJQ0ibhJ6rlkpw= +github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs= +github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs= +github.com/creack/pty v1.1.9/go.mod h1:oKZEueFk5CKHvIhNR5MUki03XCEU+Q6VDXinZuGJ33E= +github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c= +github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= +github.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8= +github.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX3N/iU= +github.com/klauspost/compress v1.18.0 h1:c/Cqfb0r+Yi+JtIEq73FWXVkRonBlf0CRNYc8Zttxdo= +github.com/klauspost/compress v1.18.0/go.mod h1:2Pp+KzxcywXVXMr50+X0Q/Lsb43OQHYWRCY2AiWywWQ= +github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE= +github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk= +github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY= +github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE= +github.com/kylelemons/godebug v1.1.0 h1:RPNrshWIDI6G2gRW9EHilWtl7Z6Sb1BR0xunSBf0SNc= +github.com/kylelemons/godebug v1.1.0/go.mod h1:9/0rRGxNHcop5bhtWyNeEfOS8JIWk580+fNqagV/RAw= github.com/miekg/dns v1.1.61 h1:nLxbwF3XxhwVSm8g9Dghm9MHPaUZuqhPiGL+675ZmEs= github.com/miekg/dns v1.1.61/go.mod h1:mnAarhS3nWaW+NVP2wTkYVIZyHNJ098SJZUki3eykwQ= +github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 h1:C3w9PqII01/Oq1c1nUAm88MOHcQC9l5mIlSMApZMrHA= +github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822/go.mod h1:+n7T8mK8HuQTcFwEeznm/DIxMOiR9yIdICNftLE1DvQ= +github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= +github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= +github.com/prometheus/client_golang v1.23.2 h1:Je96obch5RDVy3FDMndoUsjAhG5Edi49h0RJWRi/o0o= +github.com/prometheus/client_golang v1.23.2/go.mod h1:Tb1a6LWHB3/SPIzCoaDXI4I8UHKeFTEQ1YCr+0Gyqmg= +github.com/prometheus/client_model v0.6.2 h1:oBsgwpGs7iVziMvrGhE53c/GrLUsZdHnqNwqPLxwZyk= +github.com/prometheus/client_model v0.6.2/go.mod h1:y3m2F6Gdpfy6Ut/GBsUqTWZqCUvMVzSfMLjcu6wAwpE= +github.com/prometheus/common v0.66.1 h1:h5E0h5/Y8niHc5DlaLlWLArTQI7tMrsfQjHV+d9ZoGs= +github.com/prometheus/common v0.66.1/go.mod h1:gcaUsgf3KfRSwHY4dIMXLPV0K/Wg1oZ8+SbZk/HH/dA= +github.com/prometheus/procfs v0.16.1 h1:hZ15bTNuirocR6u0JZ6BAHHmwS1p8B4P6MRqxtzMyRg= +github.com/prometheus/procfs v0.16.1/go.mod h1:teAbpZRB1iIAJYREa1LsoWUXykVXA1KlTmWl8x/U+Is= +github.com/rogpeppe/go-internal v1.10.0 h1:TMyTOH3F/DB16zRVcYyreMH6GnZZrwQVAoYjRBZyWFQ= +github.com/rogpeppe/go-internal v1.10.0/go.mod h1:UQnix2H7Ngw/k4C5ijL5+65zddjncjaFoBhdsK/akog= +github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U= +github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U= +go.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto= +go.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE= +go.yaml.in/yaml/v2 v2.4.2 h1:DzmwEr2rDGHl7lsFgAHxmNz/1NlQ7xLIrlN2h5d1eGI= +go.yaml.in/yaml/v2 v2.4.2/go.mod h1:081UH+NErpNdqlCXm3TtEran0rJZGxAYx9hb/ELlsPU= golang.org/x/mod v0.18.0 h1:5+9lSbEzPSdWkH32vYPBwEpX8KwDbM52Ud9xBUvNlb0= golang.org/x/mod v0.18.0/go.mod h1:hTbmBsO62+eylJbnUtE2MGJUyE7QWk4xUqPFrRgJ+7c= -golang.org/x/net v0.38.0 h1:vRMAPTMaeGqVhG5QyLJHqNDwecKTomGeqbnfZyKlBI8= -golang.org/x/net v0.38.0/go.mod h1:ivrbrMbzFq5J41QOQh0siUuly180yBYtLp+CKbEaFx8= -golang.org/x/sync v0.7.0 h1:YsImfSBoP9QPYL0xyKJPq0gcaJdG3rInoqxTWbfQu9M= -golang.org/x/sync v0.7.0/go.mod h1:Czt+wKu1gCyEFDUtn0jG5QVvpJ6rzVqr5aXyt9drQfk= -golang.org/x/sys v0.31.0 h1:ioabZlmFYtWhL+TRYpcnNlLwhyxaM9kWTDEmfnprqik= -golang.org/x/sys v0.31.0/go.mod h1:BJP2sWEmIv4KK5OTEluFJCKSidICx8ciO85XgH3Ak8k= +golang.org/x/net v0.43.0 h1:lat02VYK2j4aLzMzecihNvTlJNQUq316m2Mr9rnM6YE= +golang.org/x/net v0.43.0/go.mod h1:vhO1fvI4dGsIjh73sWfUVjj3N7CA9WkKJNQm2svM6Jg= +golang.org/x/sync v0.13.0 h1:AauUjRAJ9OSnvULf/ARrrVywoJDy0YS2AwQ98I37610= +golang.org/x/sync v0.13.0/go.mod h1:1dzgHSNfp02xaA81J2MS99Qcpr2w7fw1gpm99rleRqA= +golang.org/x/sys v0.35.0 h1:vz1N37gP5bs89s7He8XuIYXpyY0+QlsKmzipCbUtyxI= +golang.org/x/sys v0.35.0/go.mod h1:BJP2sWEmIv4KK5OTEluFJCKSidICx8ciO85XgH3Ak8k= golang.org/x/tools v0.22.0 h1:gqSGLZqv+AI9lIQzniJ0nZDRG5GBPsSi+DRNHWNz6yA= golang.org/x/tools v0.22.0/go.mod h1:aCwcsjqvq7Yqt6TNyX7QMU2enbQ/Gt0bo6krSeEri+c= +google.golang.org/protobuf v1.36.8 h1:xHScyCOEuuwZEc6UtSOvPbAT4zRh0xcNRYekJwfqyMc= +google.golang.org/protobuf v1.36.8/go.mod h1:fuxRtAxBytpl4zzqUh6/eyUujkJdNiuEkXntxiD/uRU= +gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= +gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk= +gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q= +gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= +gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= diff --git a/components/egress/main.go b/components/egress/main.go index bca6d33b..d6b4eafe 100644 --- a/components/egress/main.go +++ b/components/egress/main.go @@ -25,6 +25,7 @@ import ( "github.com/alibaba/opensandbox/egress/pkg/constants" "github.com/alibaba/opensandbox/egress/pkg/dnsproxy" "github.com/alibaba/opensandbox/egress/pkg/iptables" + "github.com/alibaba/opensandbox/egress/pkg/metrics" ) func main() { @@ -39,11 +40,13 @@ func main() { allowIPs := AllowIPsForNft("/etc/resolv.conf") mode := parseMode() + metrics.SetEnforcementMode(mode) nftMgr := createNftManager(mode) proxy, err := dnsproxy.New(initialRules, "") if err != nil { log.Fatalf("failed to init dns proxy: %v", err) } + metrics.SetPolicyRuleCount(initialRules.DefaultAction, len(initialRules.Egress)) if err := proxy.Start(ctx); err != nil { log.Fatalf("failed to start dns proxy: %v", err) } diff --git a/components/egress/nft.go b/components/egress/nft.go index 74ce65a1..5a909bb2 100644 --- a/components/egress/nft.go +++ b/components/egress/nft.go @@ -23,6 +23,7 @@ import ( "github.com/alibaba/opensandbox/egress/pkg/constants" "github.com/alibaba/opensandbox/egress/pkg/dnsproxy" + "github.com/alibaba/opensandbox/egress/pkg/metrics" "github.com/alibaba/opensandbox/egress/pkg/nftables" "github.com/alibaba/opensandbox/egress/pkg/policy" ) @@ -43,12 +44,16 @@ func setupNft(ctx context.Context, nftMgr nftApplier, initialPolicy *policy.Netw } policyWithNS := initialPolicy.WithExtraAllowIPs(nameserverIPs) if err := nftMgr.ApplyStatic(ctx, policyWithNS); err != nil { + metrics.NftApplyTotal.WithLabelValues(metrics.ResultFailure).Inc() log.Fatalf("nftables static apply failed: %v", err) } + metrics.NftApplyTotal.WithLabelValues(metrics.ResultSuccess).Inc() log.Printf("nftables static policy applied (table inet opensandbox)") proxy.SetOnResolved(func(domain string, ips []nftables.ResolvedIP) { if err := nftMgr.AddResolvedIPs(ctx, ips); err != nil { log.Printf("[dns] add resolved IPs to nft failed: %v", err) + } else { + metrics.NftResolvedIPsAddedTotal.Add(float64(len(ips))) } }) } diff --git a/components/egress/pkg/constants/configuration.go b/components/egress/pkg/constants/configuration.go index d4d44fe2..58b7e74d 100644 --- a/components/egress/pkg/constants/configuration.go +++ b/components/egress/pkg/constants/configuration.go @@ -15,13 +15,14 @@ package constants const ( - EnvBlockDoH443 = "OPENSANDBOX_EGRESS_BLOCK_DOH_443" - EnvDoHBlocklist = "OPENSANDBOX_EGRESS_DOH_BLOCKLIST" // comma-separated IP/CIDR - EnvEgressMode = "OPENSANDBOX_EGRESS_MODE" // dns | dns+nft - EnvEgressHTTPAddr = "OPENSANDBOX_EGRESS_HTTP_ADDR" - EnvEgressToken = "OPENSANDBOX_EGRESS_TOKEN" - EnvEgressRules = "OPENSANDBOX_EGRESS_RULES" - EnvMaxNameservers = "OPENSANDBOX_EGRESS_MAX_NS" + EnvBlockDoH443 = "OPENSANDBOX_EGRESS_BLOCK_DOH_443" + EnvDoHBlocklist = "OPENSANDBOX_EGRESS_DOH_BLOCKLIST" // comma-separated IP/CIDR + EnvEgressMode = "OPENSANDBOX_EGRESS_MODE" // dns | dns+nft + EnvEgressHTTPAddr = "OPENSANDBOX_EGRESS_HTTP_ADDR" + EnvEgressToken = "OPENSANDBOX_EGRESS_TOKEN" + EnvEgressRules = "OPENSANDBOX_EGRESS_RULES" + EnvEgressInstanceID = "OPENSANDBOX_EGRESS_INSTANCE_ID" // unique instance id for metrics instance_id label + EnvMaxNameservers = "OPENSANDBOX_EGRESS_MAX_NS" ) const ( diff --git a/components/egress/pkg/dnsproxy/proxy.go b/components/egress/pkg/dnsproxy/proxy.go index bd761612..d40f4de5 100644 --- a/components/egress/pkg/dnsproxy/proxy.go +++ b/components/egress/pkg/dnsproxy/proxy.go @@ -26,6 +26,7 @@ import ( "github.com/miekg/dns" + "github.com/alibaba/opensandbox/egress/pkg/metrics" "github.com/alibaba/opensandbox/egress/pkg/nftables" "github.com/alibaba/opensandbox/egress/pkg/policy" ) @@ -109,20 +110,26 @@ func (p *Proxy) serveDNS(w dns.ResponseWriter, r *dns.Msg) { currentPolicy := p.policy p.policyMu.RUnlock() if currentPolicy != nil && currentPolicy.Evaluate(domain) == policy.ActionDeny { + metrics.DNSQueriesTotal.WithLabelValues(metrics.ResultDenied).Inc() + metrics.ViolationsTotal.WithLabelValues(metrics.ViolationTypeDNSDeny).Inc() resp := new(dns.Msg) resp.SetRcode(r, dns.RcodeNameError) _ = w.WriteMsg(resp) return } + start := time.Now() resp, err := p.forward(r) + metrics.DNSForwardDurationSeconds.Observe(time.Since(start).Seconds()) if err != nil { + metrics.DNSQueriesTotal.WithLabelValues(metrics.ResultForwardError).Inc() log.Printf("[dns] forward error for %s: %v", domain, err) fail := new(dns.Msg) fail.SetRcode(r, dns.RcodeServerFailure) _ = w.WriteMsg(fail) return } + metrics.DNSQueriesTotal.WithLabelValues(metrics.ResultAllowed).Inc() p.maybeNotifyResolved(domain, resp) _ = w.WriteMsg(resp) } diff --git a/components/egress/pkg/metrics/metrics.go b/components/egress/pkg/metrics/metrics.go new file mode 100644 index 00000000..44a79e39 --- /dev/null +++ b/components/egress/pkg/metrics/metrics.go @@ -0,0 +1,238 @@ +// Copyright 2026 Alibaba Group Holding Ltd. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package metrics + +import ( + "os" + "time" + + "github.com/prometheus/client_golang/prometheus" + "github.com/prometheus/client_golang/prometheus/promauto" + + "github.com/alibaba/opensandbox/egress/pkg/constants" +) + +const ( + namespace = "opensandbox" + subsystem = "egress" +) + +var ( + // instanceID is set once at init from OPENSANDBOX_EGRESS_INSTANCE_ID or hostname. + instanceID string + // startTime is used for uptime_seconds. + startTime time.Time +) + +func init() { + if v := os.Getenv(constants.EnvEgressInstanceID); v != "" { + instanceID = v + } else { + hostname, _ := os.Hostname() + instanceID = hostname + } + startTime = time.Now() +} + +// InstanceID returns the instance identifier for this sidecar (from env or hostname). +func InstanceID() string { return instanceID } + +func constLabels() prometheus.Labels { + return prometheus.Labels{"instance_id": instanceID} +} + +// DNS layer (Layer 1) +var ( + // DNSQueriesTotal counts DNS queries by result: allowed, denied, forward_error. + DNSQueriesTotal = promauto.NewCounterVec( + prometheus.CounterOpts{ + Namespace: namespace, + Subsystem: subsystem, + Name: "dns_queries_total", + Help: "Total DNS queries handled by the proxy, by result.", + ConstLabels: constLabels(), + }, + []string{"result"}, + ) + + // DNSForwardDurationSeconds is the latency of upstream DNS forward. + DNSForwardDurationSeconds = promauto.NewHistogram( + prometheus.HistogramOpts{ + Namespace: namespace, + Subsystem: subsystem, + Name: "dns_forward_duration_seconds", + Help: "Latency of forwarding DNS queries to upstream.", + ConstLabels: constLabels(), + Buckets: prometheus.DefBuckets, + }, + ) +) + +// Policy and runtime +var ( + // PolicyUpdatesTotal counts successful POST /policy updates. + PolicyUpdatesTotal = promauto.NewCounter( + prometheus.CounterOpts{ + Namespace: namespace, + Subsystem: subsystem, + Name: "policy_updates_total", + Help: "Total number of successful policy updates via POST /policy.", + ConstLabels: constLabels(), + }, + ) + + // PolicyRuleCount is the current number of egress rules in the active policy. + PolicyRuleCount = promauto.NewGaugeVec( + prometheus.GaugeOpts{ + Namespace: namespace, + Subsystem: subsystem, + Name: "policy_rule_count", + Help: "Current number of egress rules in the active policy.", + ConstLabels: constLabels(), + }, + []string{"default_action"}, + ) + + // EnforcementMode is 1 for the current mode (label: mode=dns or dns+nft). + EnforcementMode = promauto.NewGaugeVec( + prometheus.GaugeOpts{ + Namespace: namespace, + Subsystem: subsystem, + Name: "enforcement_mode", + Help: "Current enforcement mode (1 for the active mode).", + ConstLabels: constLabels(), + }, + []string{"mode"}, + ) +) + +// nftables (Layer 2) +var ( + // NftApplyTotal counts nftables ApplyStatic calls by result: success, failure. + NftApplyTotal = promauto.NewCounterVec( + prometheus.CounterOpts{ + Namespace: namespace, + Subsystem: subsystem, + Name: "nft_apply_total", + Help: "Total number of nftables static rule apply operations.", + ConstLabels: constLabels(), + }, + []string{"result"}, + ) + + // NftResolvedIPsAddedTotal counts IPs added to nftables dynamic set from DNS. + NftResolvedIPsAddedTotal = promauto.NewCounter( + prometheus.CounterOpts{ + Namespace: namespace, + Subsystem: subsystem, + Name: "nft_resolved_ips_added_total", + Help: "Total number of resolved IPs added to nftables dynamic allow set.", + ConstLabels: constLabels(), + }, + ) + + // NftDohDotPacketsDroppedTotal counts packets dropped by DoH/DoT blocking. + NftDohDotPacketsDroppedTotal = promauto.NewCounterVec( + prometheus.CounterOpts{ + Namespace: namespace, + Subsystem: subsystem, + Name: "nft_doh_dot_packets_dropped_total", + Help: "Total packets dropped due to DoH/DoT blocking.", + ConstLabels: constLabels(), + }, + []string{"reason"}, + ) +) + +// Violations (R7) +var ( + // ViolationsTotal counts policy denials (e.g. DNS NXDOMAIN). + ViolationsTotal = promauto.NewCounterVec( + prometheus.CounterOpts{ + Namespace: namespace, + Subsystem: subsystem, + Name: "violations_total", + Help: "Total number of policy violations (e.g. DNS denied).", + ConstLabels: constLabels(), + }, + []string{"type"}, + ) +) + +// Process / runtime info +var ( + // EgressInfo is 1 with labels identifying this instance (enforcement_mode, version). + EgressInfo = promauto.NewGaugeVec( + prometheus.GaugeOpts{ + Namespace: namespace, + Subsystem: subsystem, + Name: "info", + Help: "Info metric with labels for instance and environment.", + ConstLabels: constLabels(), + }, + []string{"enforcement_mode", "version"}, + ) + + // UptimeSeconds is process uptime in seconds (updated on each scrape via GaugeFunc). + UptimeSeconds = promauto.NewGaugeFunc( + prometheus.GaugeOpts{ + Namespace: namespace, + Subsystem: subsystem, + Name: "uptime_seconds", + Help: "Process uptime in seconds.", + ConstLabels: constLabels(), + }, + func() float64 { return time.Since(startTime).Seconds() }, + ) +) + +// Result label values for DNS and nft. +const ( + ResultAllowed = "allowed" + ResultDenied = "denied" + ResultForwardError = "forward_error" + ResultSuccess = "success" + ResultFailure = "failure" +) + +// Violation type label values. +const ( + ViolationTypeDNSDeny = "dns_deny" +) + +// DoH/DoT drop reason label values. +const ( + ReasonDot853 = "dot_853" + ReasonDoh443 = "doh_443" +) + +// Version may be set at build time (-ldflags). +var Version = "0.0.0" + +// SetEnforcementMode sets the current mode for opensandbox_egress_enforcement_mode and opensandbox_egress_info. +// Should be called once from main after mode is known. +func SetEnforcementMode(mode string) { + EnforcementMode.Reset() + EnforcementMode.WithLabelValues(mode).Set(1) + EgressInfo.Reset() + EgressInfo.WithLabelValues(mode, Version).Set(1) +} + +// SetPolicyRuleCount updates the policy_rule_count gauge for the given default_action. +// Call when policy is loaded or updated. +func SetPolicyRuleCount(defaultAction string, ruleCount int) { + PolicyRuleCount.Reset() + PolicyRuleCount.WithLabelValues(defaultAction).Set(float64(ruleCount)) +} diff --git a/components/egress/policy_server.go b/components/egress/policy_server.go index 25117dce..b1fa095a 100644 --- a/components/egress/policy_server.go +++ b/components/egress/policy_server.go @@ -28,8 +28,10 @@ import ( "time" "github.com/alibaba/opensandbox/egress/pkg/constants" + "github.com/alibaba/opensandbox/egress/pkg/metrics" "github.com/alibaba/opensandbox/egress/pkg/nftables" "github.com/alibaba/opensandbox/egress/pkg/policy" + "github.com/prometheus/client_golang/prometheus/promhttp" ) type policyUpdater interface { @@ -58,6 +60,7 @@ func startPolicyServer(ctx context.Context, proxy policyUpdater, nft nftApplier, mux := http.NewServeMux() handler := &policyServer{proxy: proxy, nft: nft, token: token, enforcementMode: enforcementMode, nameserverIPs: nameserverIPs} mux.HandleFunc("/policy", handler.handlePolicy) + mux.HandleFunc("/metrics", promhttp.Handler().ServeHTTP) mux.HandleFunc("/healthz", func(w http.ResponseWriter, _ *http.Request) { w.WriteHeader(http.StatusOK) _, _ = w.Write([]byte("ok")) @@ -146,11 +149,15 @@ func (s *policyServer) handlePost(w http.ResponseWriter, r *http.Request) { if s.nft != nil { defWithNS := def.WithExtraAllowIPs(s.nameserverIPs) if err := s.nft.ApplyStatic(r.Context(), defWithNS); err != nil { + metrics.NftApplyTotal.WithLabelValues(metrics.ResultFailure).Inc() http.Error(w, fmt.Sprintf("failed to apply nftables: %v", err), http.StatusInternalServerError) return } + metrics.NftApplyTotal.WithLabelValues(metrics.ResultSuccess).Inc() } s.proxy.UpdatePolicy(def) + metrics.PolicyUpdatesTotal.Inc() + metrics.SetPolicyRuleCount(def.DefaultAction, len(def.Egress)) writeJSON(w, http.StatusOK, map[string]any{ "status": "ok", "mode": "deny_all", @@ -167,11 +174,15 @@ func (s *policyServer) handlePost(w http.ResponseWriter, r *http.Request) { if s.nft != nil { polWithNS := pol.WithExtraAllowIPs(s.nameserverIPs) if err := s.nft.ApplyStatic(r.Context(), polWithNS); err != nil { + metrics.NftApplyTotal.WithLabelValues(metrics.ResultFailure).Inc() http.Error(w, fmt.Sprintf("failed to apply nftables policy: %v", err), http.StatusInternalServerError) return } + metrics.NftApplyTotal.WithLabelValues(metrics.ResultSuccess).Inc() } s.proxy.UpdatePolicy(pol) + metrics.PolicyUpdatesTotal.Inc() + metrics.SetPolicyRuleCount(pol.DefaultAction, len(pol.Egress)) writeJSON(w, http.StatusOK, map[string]any{ "status": "ok", "mode": modeFromPolicy(pol), diff --git a/components/egress/tests/smoke-dns.sh b/components/egress/tests/smoke-dns.sh index 082e337f..b11966ae 100755 --- a/components/egress/tests/smoke-dns.sh +++ b/components/egress/tests/smoke-dns.sh @@ -72,4 +72,6 @@ info "Test: allowed domain should succeed (api.github.com)" run_in_app -I https://api.github.com --max-time 10 >/dev/null 2>&1 || fail "api.github.com should succeed" pass "api.github.com allowed" -info "All smoke tests passed." \ No newline at end of file +info "All smoke tests passed." +info "Fetching metrics..." +curl -sf "http://127.0.0.1:${POLICY_PORT}/metrics" || true \ No newline at end of file diff --git a/components/egress/tests/smoke-dynamic-ip.sh b/components/egress/tests/smoke-dynamic-ip.sh index 22947779..a2c09a6c 100755 --- a/components/egress/tests/smoke-dynamic-ip.sh +++ b/components/egress/tests/smoke-dynamic-ip.sh @@ -81,3 +81,5 @@ else fi info "All smoke tests (dynamic IP) passed." +info "Fetching metrics..." +curl -sf "http://127.0.0.1:${POLICY_PORT}/metrics" || true diff --git a/components/egress/tests/smoke-nft.sh b/components/egress/tests/smoke-nft.sh index e2052963..58517c89 100755 --- a/components/egress/tests/smoke-nft.sh +++ b/components/egress/tests/smoke-nft.sh @@ -90,4 +90,6 @@ else pass "DoT 853 blocked" fi -info "All smoke tests passed." \ No newline at end of file +info "All smoke tests passed." +info "Fetching metrics..." +curl -sf "http://127.0.0.1:${POLICY_PORT}/metrics" || true \ No newline at end of file