Skip to content

Feat/enhance agents and add docs#33

Merged
YounBrand merged 8 commits intomainfrom
feat/enhance-agents-and-add-docs
Apr 9, 2026
Merged

Feat/enhance agents and add docs#33
YounBrand merged 8 commits intomainfrom
feat/enhance-agents-and-add-docs

Conversation

@CjKhaled
Copy link
Copy Markdown
Collaborator

@CjKhaled CjKhaled commented Apr 8, 2026

Enhance Agents: Metrics, Tools, Evals, Logging & Documentation

Summary

Adds end-to-end observability, evaluation infrastructure, and comprehensive documentation to the agents module. The three LLM-powered agents (budget, debt, investing) now have persistent metrics, structured logging, an offline eval harness, improved tool definitions, and a 16-section reference doc — all without changing the existing API contract or breaking any of the 1235 existing tests.


What changed

1. Agent Metrics — Separate DynamoDB Table

Every agent.invoke() returns a metrics object that was previously discarded. Now it's persisted to a new AgentMetrics table after every successful invocation.

What's stored per invocation:

  • Token usage (input / output / cache read / cache write) for cost monitoring
  • Wall-clock duration + model latency for latency benchmarking
  • Cycle count + average cycle duration for reasoning efficiency tracking
  • Per-tool stats (callCount, successCount, errorCount, averageTimeMs, successRate) for tool performance analysis

Key design decisions:

  • Table: AgentMetrics with userId (PK) + metricId ULID (SK), plus a GSI on agentType + createdAt for trend queries
  • Fire-and-forget: metrics save failure never blocks proposal creation — the proposal is the critical path
  • Correlated with logs via invocationId ULID (added later in the logging work)

Files:

  • server/src/db/tables.ts — table + GSI constants
  • server/src/scripts/setup-tables.ts — local table creation
  • terraform/dynamodb.tf + terraform/variables.tf — infra definition
  • server/src/modules/agents/agents.types.tsAgentMetricsRecord, StoredToolMetrics, AgentInvokeResult<T>
  • server/src/modules/agents/agents.repository.tssaveAgentMetrics()
  • server/src/modules/agents/core/{budget,debt,investing}-agent.ts — return { output, metrics } instead of raw output
  • server/src/modules/agents/agents.service.tsbuildMetricsRecord() + fire-and-forget save in all three run*Agent functions

2. Tool Improvements

Code review–driven improvements to server/src/modules/agents/core/tools.ts:

  • Removed JSON.parse(JSON.stringify()) hack from 3 tool callbacks — returns objects directly
  • Added structured error handling — all tools return { error, message, retryable } instead of throwing, so the agent can retry or switch strategy
  • Added isEmpty flags to all collection returns — prevents the agent from hallucinating about empty vs. missing data
  • Improved descriptions — added "Call once per session", "Do NOT use this for…", and sequencing hints to reduce redundant tool calls
  • New getUserFinancialSnapshot tool — fetches accounts + holdings + liabilities in parallel via Promise.allSettled, tolerates partial failures. Registered in all three agents.
  • Updated system prompts in core/prompts.ts to reference the snapshot tool

3. Evaluation Harness

Offline eval system for validating agent decision quality against known financial scenarios. Evals hit the real Anthropic API but mock the service layer so no DB/Plaid I/O occurs.

Architecture:

  • Hard constraints (binary) — derived from system prompt rules (e.g. "sum must equal income", "investments must be nonzero"). Failing = agent is broken.
  • Soft scores (weighted 0–1) — grade decision quality (e.g. 50/30/20 adherence, avalanche ordering correlation, account priority waterfall). Aggregated across N runs.
  • Non-determinism handling — each case runs 5 times by default; pass rates and score statistics (mean/stddev) are computed per constraint.

Structure:

server/src/evals/
  eval.types.ts              — case / score / result types
  eval.config.ts             — thresholds, tolerances, runs-per-case
  fixtures/mock-data.ts      — factory functions matching real service return types
  cases/{budget,debt,investing}.cases.ts  — 6/5/6 scenario cases
  scoring/shared.ts          — tolerance helpers, amortization math, rank correlation
  scoring/{budget,debt,investing}.scorer.ts  — pure scoring functions
  runners/aggregator.ts      — collapses N runs into stats
  runners/eval-runner.ts     — generic runner with service mock installation
  runners/{budget,debt,investing}.eval.ts  — vitest entry points

Commands:

npm run eval           # all agents
npm run eval:budget    # budget only
npm run eval:debt
npm run eval:investing

Evals use a separate vitest.eval.config.ts with 120s timeout and no DynamoDB global setup. They are never picked up by npm test.

4. Agent Logging

Structured pino logging wired into the Strands SDK and the agent service layer:

  • lib/agent-logger.ts — singleton that creates the pino root logger and calls configureLogging() from the Strands SDK, routing SDK-internal warnings/errors through our pipeline
  • Per-invocation child loggers bound with OTel-style fields: agent.type, agent.invocation_id, user.id
  • Log points: invocation start (debug), success (info with token/duration/cycle summary), failure (error), metrics persistence failure (warn), proposal lifecycle transitions (info)
  • Silent .catch(() => {}) replaced with a warn log — metrics loss is still acceptable but no longer invisible
  • invocationId added to AgentMetricsRecord so logs and stored metrics are joinable

5. Documentation

Comprehensive 16-section reference doc at server/docs/agents.md covering: overview, directory layout, HTTP API, the three agents, tools, proposal lifecycle, orchestration service, LLM configuration, logging, metrics, evaluation harness, types reference, service dependencies, configuration, testing, and design principles.


Testing

Suite Tests Status
Existing unit + integration tests 1235 ✅ All pass
New agents.repository.test.ts tests (saveAgentMetrics) 6
New agents.service.test.ts tests (metrics save, fire-and-forget) 4
New tools.test.ts (error handling, isEmpty, snapshot, descriptions) 25
New eval scorer unit tests (shared + budget + debt + investing) 89
New eval runner unit tests (aggregator + eval-runner) 13
Total after PR ~1372

All tests follow TDD — failing tests were written first, then implementations were added to make them pass.

Evals (npm run eval:*) are integration tests that hit the real Anthropic API and are run separately — they are not part of npm test.


Files changed

New files

Path Purpose
server/src/lib/agent-logger.ts Pino singleton + Strands SDK log wiring
server/src/evals/** (17 files) Full eval harness
server/vitest.eval.config.ts Eval-specific vitest config
server/docs/agents.md Comprehensive documentation

Modified files

Path Change
server/src/db/tables.ts Added AgentMetrics table + GSI constants
server/src/scripts/setup-tables.ts Added AgentMetrics table creation
terraform/dynamodb.tf Added aws_dynamodb_table.agent_metrics
terraform/variables.tf Added agent_metrics_table variable
server/src/modules/agents/agents.types.ts Added metrics types, invocationId
server/src/modules/agents/agents.repository.ts Added saveAgentMetrics()
server/src/modules/agents/agents.service.ts Metrics threading, logging, lifecycle logs
server/src/modules/agents/core/tools.ts Error handling, isEmpty, snapshot tool, descriptions
server/src/modules/agents/core/{budget,debt,investing}-agent.ts Return { output, metrics }, register snapshot tool
server/src/modules/agents/core/prompts.ts Reference snapshot tool in STEP 1
server/package.json Added eval scripts
server/src/modules/agents/agents-unit-tests/*.test.ts Updated for new types + new tests

How to verify

  1. Unit tests: cd server && npm test — all 1372+ tests pass
  2. Type check: npx tsc --noEmit — no new errors (pre-existing tools.ts SDK type issues are unrelated)
  3. Local DynamoDB: npx tsx src/scripts/setup-tables.tsAgentMetrics table appears
  4. Evals (requires ANTHROPIC_API_KEY): npm run eval:budget — runs agent against scored scenarios
  5. Logging: npm run dev + POST /api/agent/budget — structured log lines with correlation IDs appear

Notes

  • No API contract changes — all existing endpoints return the same shapes
  • No migration needed — DynamoDB is schemaless; the new AgentMetrics table is additive
  • Evals are opt-in — they cost real API calls (~85 calls for a full run) and should be run manually or on a nightly schedule, never on every PR
  • Cost: the metrics table adds ~200 bytes per proposal; the eval results directory is .gitignored

@YounBrand YounBrand merged commit a372c10 into main Apr 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants