Merged
Conversation
YounBrand
approved these changes
Apr 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Enhance Agents: Metrics, Tools, Evals, Logging & Documentation
Summary
Adds end-to-end observability, evaluation infrastructure, and comprehensive documentation to the agents module. The three LLM-powered agents (budget, debt, investing) now have persistent metrics, structured logging, an offline eval harness, improved tool definitions, and a 16-section reference doc — all without changing the existing API contract or breaking any of the 1235 existing tests.
What changed
1. Agent Metrics — Separate DynamoDB Table
Every
agent.invoke()returns ametricsobject that was previously discarded. Now it's persisted to a newAgentMetricstable after every successful invocation.What's stored per invocation:
Key design decisions:
AgentMetricswithuserId(PK) +metricIdULID (SK), plus a GSI onagentType + createdAtfor trend queriesinvocationIdULID (added later in the logging work)Files:
server/src/db/tables.ts— table + GSI constantsserver/src/scripts/setup-tables.ts— local table creationterraform/dynamodb.tf+terraform/variables.tf— infra definitionserver/src/modules/agents/agents.types.ts—AgentMetricsRecord,StoredToolMetrics,AgentInvokeResult<T>server/src/modules/agents/agents.repository.ts—saveAgentMetrics()server/src/modules/agents/core/{budget,debt,investing}-agent.ts— return{ output, metrics }instead of raw outputserver/src/modules/agents/agents.service.ts—buildMetricsRecord()+ fire-and-forget save in all threerun*Agentfunctions2. Tool Improvements
Code review–driven improvements to
server/src/modules/agents/core/tools.ts:JSON.parse(JSON.stringify())hack from 3 tool callbacks — returns objects directly{ error, message, retryable }instead of throwing, so the agent can retry or switch strategyisEmptyflags to all collection returns — prevents the agent from hallucinating about empty vs. missing datagetUserFinancialSnapshottool — fetches accounts + holdings + liabilities in parallel viaPromise.allSettled, tolerates partial failures. Registered in all three agents.core/prompts.tsto reference the snapshot tool3. Evaluation Harness
Offline eval system for validating agent decision quality against known financial scenarios. Evals hit the real Anthropic API but mock the service layer so no DB/Plaid I/O occurs.
Architecture:
Structure:
Commands:
Evals use a separate
vitest.eval.config.tswith 120s timeout and no DynamoDB global setup. They are never picked up bynpm test.4. Agent Logging
Structured pino logging wired into the Strands SDK and the agent service layer:
lib/agent-logger.ts— singleton that creates the pino root logger and callsconfigureLogging()from the Strands SDK, routing SDK-internal warnings/errors through our pipelineagent.type,agent.invocation_id,user.id.catch(() => {})replaced with a warn log — metrics loss is still acceptable but no longer invisibleinvocationIdadded toAgentMetricsRecordso logs and stored metrics are joinable5. Documentation
Comprehensive 16-section reference doc at
server/docs/agents.mdcovering: overview, directory layout, HTTP API, the three agents, tools, proposal lifecycle, orchestration service, LLM configuration, logging, metrics, evaluation harness, types reference, service dependencies, configuration, testing, and design principles.Testing
agents.repository.test.tstests (saveAgentMetrics)agents.service.test.tstests (metrics save, fire-and-forget)tools.test.ts(error handling, isEmpty, snapshot, descriptions)All tests follow TDD — failing tests were written first, then implementations were added to make them pass.
Evals (
npm run eval:*) are integration tests that hit the real Anthropic API and are run separately — they are not part ofnpm test.Files changed
New files
server/src/lib/agent-logger.tsserver/src/evals/**(17 files)server/vitest.eval.config.tsserver/docs/agents.mdModified files
server/src/db/tables.tsAgentMetricstable + GSI constantsserver/src/scripts/setup-tables.tsAgentMetricstable creationterraform/dynamodb.tfaws_dynamodb_table.agent_metricsterraform/variables.tfagent_metrics_tablevariableserver/src/modules/agents/agents.types.tsinvocationIdserver/src/modules/agents/agents.repository.tssaveAgentMetrics()server/src/modules/agents/agents.service.tsserver/src/modules/agents/core/tools.tsserver/src/modules/agents/core/{budget,debt,investing}-agent.ts{ output, metrics }, register snapshot toolserver/src/modules/agents/core/prompts.tsserver/package.jsonserver/src/modules/agents/agents-unit-tests/*.test.tsHow to verify
cd server && npm test— all 1372+ tests passnpx tsc --noEmit— no new errors (pre-existing tools.ts SDK type issues are unrelated)npx tsx src/scripts/setup-tables.ts—AgentMetricstable appearsANTHROPIC_API_KEY):npm run eval:budget— runs agent against scored scenariosnpm run dev+POST /api/agent/budget— structured log lines with correlation IDs appearNotes
AgentMetricstable is additive.gitignored