Feat/enhance agents and add docs by CjKhaled · Pull Request #33 · gburger5/Financial-Assistant

CjKhaled · 2026-04-08T21:27:57Z

Enhance Agents: Metrics, Tools, Evals, Logging & Documentation

Summary

Adds end-to-end observability, evaluation infrastructure, and comprehensive documentation to the agents module. The three LLM-powered agents (budget, debt, investing) now have persistent metrics, structured logging, an offline eval harness, improved tool definitions, and a 16-section reference doc — all without changing the existing API contract or breaking any of the 1235 existing tests.

What changed

1. Agent Metrics — Separate DynamoDB Table

Every agent.invoke() returns a metrics object that was previously discarded. Now it's persisted to a new AgentMetrics table after every successful invocation.

What's stored per invocation:

Token usage (input / output / cache read / cache write) for cost monitoring
Wall-clock duration + model latency for latency benchmarking
Cycle count + average cycle duration for reasoning efficiency tracking
Per-tool stats (callCount, successCount, errorCount, averageTimeMs, successRate) for tool performance analysis

Key design decisions:

Table: AgentMetrics with userId (PK) + metricId ULID (SK), plus a GSI on agentType + createdAt for trend queries
Fire-and-forget: metrics save failure never blocks proposal creation — the proposal is the critical path
Correlated with logs via invocationId ULID (added later in the logging work)

Files:

server/src/db/tables.ts — table + GSI constants
server/src/scripts/setup-tables.ts — local table creation
terraform/dynamodb.tf + terraform/variables.tf — infra definition
server/src/modules/agents/agents.types.ts — AgentMetricsRecord, StoredToolMetrics, AgentInvokeResult<T>
server/src/modules/agents/agents.repository.ts — saveAgentMetrics()
server/src/modules/agents/core/{budget,debt,investing}-agent.ts — return { output, metrics } instead of raw output
server/src/modules/agents/agents.service.ts — buildMetricsRecord() + fire-and-forget save in all three run*Agent functions

2. Tool Improvements

Code review–driven improvements to server/src/modules/agents/core/tools.ts:

Removed JSON.parse(JSON.stringify()) hack from 3 tool callbacks — returns objects directly
Added structured error handling — all tools return { error, message, retryable } instead of throwing, so the agent can retry or switch strategy
Added isEmpty flags to all collection returns — prevents the agent from hallucinating about empty vs. missing data
Improved descriptions — added "Call once per session", "Do NOT use this for…", and sequencing hints to reduce redundant tool calls
New getUserFinancialSnapshot tool — fetches accounts + holdings + liabilities in parallel via Promise.allSettled, tolerates partial failures. Registered in all three agents.
Updated system prompts in core/prompts.ts to reference the snapshot tool

3. Evaluation Harness

Offline eval system for validating agent decision quality against known financial scenarios. Evals hit the real Anthropic API but mock the service layer so no DB/Plaid I/O occurs.

Architecture:

Hard constraints (binary) — derived from system prompt rules (e.g. "sum must equal income", "investments must be nonzero"). Failing = agent is broken.
Soft scores (weighted 0–1) — grade decision quality (e.g. 50/30/20 adherence, avalanche ordering correlation, account priority waterfall). Aggregated across N runs.
Non-determinism handling — each case runs 5 times by default; pass rates and score statistics (mean/stddev) are computed per constraint.

Structure:

server/src/evals/
  eval.types.ts              — case / score / result types
  eval.config.ts             — thresholds, tolerances, runs-per-case
  fixtures/mock-data.ts      — factory functions matching real service return types
  cases/{budget,debt,investing}.cases.ts  — 6/5/6 scenario cases
  scoring/shared.ts          — tolerance helpers, amortization math, rank correlation
  scoring/{budget,debt,investing}.scorer.ts  — pure scoring functions
  runners/aggregator.ts      — collapses N runs into stats
  runners/eval-runner.ts     — generic runner with service mock installation
  runners/{budget,debt,investing}.eval.ts  — vitest entry points

Commands:

npm run eval           # all agents
npm run eval:budget    # budget only
npm run eval:debt
npm run eval:investing

Evals use a separate vitest.eval.config.ts with 120s timeout and no DynamoDB global setup. They are never picked up by npm test.

4. Agent Logging

Structured pino logging wired into the Strands SDK and the agent service layer:

lib/agent-logger.ts — singleton that creates the pino root logger and calls configureLogging() from the Strands SDK, routing SDK-internal warnings/errors through our pipeline
Per-invocation child loggers bound with OTel-style fields: agent.type, agent.invocation_id, user.id
Log points: invocation start (debug), success (info with token/duration/cycle summary), failure (error), metrics persistence failure (warn), proposal lifecycle transitions (info)
Silent .catch(() => {}) replaced with a warn log — metrics loss is still acceptable but no longer invisible
invocationId added to AgentMetricsRecord so logs and stored metrics are joinable

5. Documentation

Comprehensive 16-section reference doc at server/docs/agents.md covering: overview, directory layout, HTTP API, the three agents, tools, proposal lifecycle, orchestration service, LLM configuration, logging, metrics, evaluation harness, types reference, service dependencies, configuration, testing, and design principles.

Testing

Suite	Tests	Status
Existing unit + integration tests	1235	✅ All pass
New `agents.repository.test.ts` tests (saveAgentMetrics)	6	✅
New `agents.service.test.ts` tests (metrics save, fire-and-forget)	4	✅
New `tools.test.ts` (error handling, isEmpty, snapshot, descriptions)	25	✅
New eval scorer unit tests (shared + budget + debt + investing)	89	✅
New eval runner unit tests (aggregator + eval-runner)	13	✅
Total after PR	~1372	✅

All tests follow TDD — failing tests were written first, then implementations were added to make them pass.

Evals (npm run eval:*) are integration tests that hit the real Anthropic API and are run separately — they are not part of npm test.

Files changed

New files

Path	Purpose
`server/src/lib/agent-logger.ts`	Pino singleton + Strands SDK log wiring
`server/src/evals/**` (17 files)	Full eval harness
`server/vitest.eval.config.ts`	Eval-specific vitest config
`server/docs/agents.md`	Comprehensive documentation

Modified files

Path	Change
`server/src/db/tables.ts`	Added `AgentMetrics` table + GSI constants
`server/src/scripts/setup-tables.ts`	Added `AgentMetrics` table creation
`terraform/dynamodb.tf`	Added `aws_dynamodb_table.agent_metrics`
`terraform/variables.tf`	Added `agent_metrics_table` variable
`server/src/modules/agents/agents.types.ts`	Added metrics types, `invocationId`
`server/src/modules/agents/agents.repository.ts`	Added `saveAgentMetrics()`
`server/src/modules/agents/agents.service.ts`	Metrics threading, logging, lifecycle logs
`server/src/modules/agents/core/tools.ts`	Error handling, isEmpty, snapshot tool, descriptions
`server/src/modules/agents/core/{budget,debt,investing}-agent.ts`	Return `{ output, metrics }`, register snapshot tool
`server/src/modules/agents/core/prompts.ts`	Reference snapshot tool in STEP 1
`server/package.json`	Added eval scripts
`server/src/modules/agents/agents-unit-tests/*.test.ts`	Updated for new types + new tests

How to verify

Unit tests: cd server && npm test — all 1372+ tests pass
Type check: npx tsc --noEmit — no new errors (pre-existing tools.ts SDK type issues are unrelated)
Local DynamoDB: npx tsx src/scripts/setup-tables.ts — AgentMetrics table appears
Evals (requires ANTHROPIC_API_KEY): npm run eval:budget — runs agent against scored scenarios
Logging: npm run dev + POST /api/agent/budget — structured log lines with correlation IDs appear

Notes

No API contract changes — all existing endpoints return the same shapes
No migration needed — DynamoDB is schemaless; the new AgentMetrics table is additive
Evals are opt-in — they cost real API calls (~85 calls for a full run) and should be run manually or on a nightly schedule, never on every PR
Cost: the metrics table adds ~200 bytes per proposal; the eval results directory is .gitignored

CjKhaled added 8 commits April 3, 2026 18:35

refactor agent prompts and directory

12297b1

add agents metrics for analytics

5646796

enhance tools

ab61821

complete agent evaluation

13e1859

add agent logging

db2973e

add documentation for agents

c3cd1b1

fix lint and build issues

45cf100

fix lint issue from using any in typescript

ed2261e

YounBrand approved these changes Apr 9, 2026

View reviewed changes

YounBrand merged commit a372c10 into main Apr 9, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/enhance agents and add docs#33

Feat/enhance agents and add docs#33
YounBrand merged 8 commits intomainfrom
feat/enhance-agents-and-add-docs

CjKhaled commented Apr 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CjKhaled commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enhance Agents: Metrics, Tools, Evals, Logging & Documentation

Summary

What changed

1. Agent Metrics — Separate DynamoDB Table

2. Tool Improvements

3. Evaluation Harness

4. Agent Logging

5. Documentation

Testing

Files changed

New files

Modified files

How to verify

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CjKhaled commented Apr 8, 2026 •

edited

Loading