From 01dc8f177d5f9f79f18581d6c87b1977a27d6e22 Mon Sep 17 00:00:00 2001 From: Meghana Ashok Date: Wed, 18 Mar 2026 11:16:41 -0700 Subject: [PATCH 1/2] Add Agentcore evaluation integration documents --- docs/AGENTCORE_EVALUATIONS_GUIDE.md | 1187 +++++++++++++++++++++++++++ 1 file changed, 1187 insertions(+) create mode 100644 docs/AGENTCORE_EVALUATIONS_GUIDE.md diff --git a/docs/AGENTCORE_EVALUATIONS_GUIDE.md b/docs/AGENTCORE_EVALUATIONS_GUIDE.md new file mode 100644 index 0000000..0982644 --- /dev/null +++ b/docs/AGENTCORE_EVALUATIONS_GUIDE.md @@ -0,0 +1,1187 @@ +# Amazon Bedrock AgentCore Evaluations Guide + +This guide covers how to use Amazon Bedrock AgentCore's evaluation capabilities to assess and monitor AI agent performance. It includes instructions for launching on-demand evaluations, configuring online (live) evaluations, understanding available metrics, downloading results, using AI-powered analysis to improve your agent, and building dashboards. + +--- + +## Table of Contents + +1. [Prerequisites](#prerequisites) +2. [SDK Installation](#sdk-installation) +3. [Online (Live) Evaluations](#online-live-evaluations) +4. [On-Demand Evaluations](#on-demand-evaluations) +5. [Built-in Evaluators Reference](#built-in-evaluators-reference) +6. [Custom Evaluators](#custom-evaluators) +7. [Evaluation Results & Format](#evaluation-results--format) +8. [Downloading & Querying Results](#downloading--querying-results) +9. [AI-Powered Analysis](#ai-powered-analysis) +10. [Retrieving Sessions, Traces & Spans (Observability)](#retrieving-sessions-traces--spans-observability) +11. [Building a Dashboard](#building-a-dashboard) +12. [Best Practices](#best-practices) +13. [References](#references) + +--- + +## Prerequisites + +- An Amazon Bedrock AgentCore runtime with a deployed agent +- AWS credentials with permissions for: + - `bedrock-agentcore:*` (control and data plane) + - `iam:CreateRole`, `iam:AttachRolePolicy`, `iam:PutRolePolicy`, `iam:GetRole`, `iam:PassRole` (scoped to `AgentCoreEvalsSDK-*` roles) + - `logs:DescribeLogGroups`, `logs:FilterLogEvents`, `logs:GetLogEvents` (for evaluation result log groups) + - `bedrock:InvokeModel` (for LLM-as-judge evaluations) +- Python 3.10+ with boto3 + +> **Region availability:** Cross-region inference profiles (e.g., `us.anthropic.claude-sonnet-4-5-20250929-v1:0`) are system-defined and use an empty account ID in the ARN. Your IAM policy must include both `arn:aws:bedrock:*:*:inference-profile/*` and `arn:aws:bedrock:*::inference-profile/*` to cover account-level and system-level profiles. Check [Bedrock supported regions](https://docs.aws.amazon.com/bedrock/latest/userguide/bedrock-regions.html) for model availability. + +### IAM Policy Example + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "bedrock-agentcore:CreateEvaluator", + "bedrock-agentcore:ListEvaluators", + "bedrock-agentcore:GetEvaluator", + "bedrock-agentcore:UpdateEvaluator", + "bedrock-agentcore:DeleteEvaluator", + "bedrock-agentcore:CreateOnlineEvaluation", + "bedrock-agentcore:ListOnlineEvaluations", + "bedrock-agentcore:GetOnlineEvaluation", + "bedrock-agentcore:UpdateOnlineEvaluation", + "bedrock-agentcore:DeleteOnlineEvaluation", + "bedrock-agentcore:Evaluate", + "bedrock-agentcore:ListEvaluationResults" + ], + "Resource": "*" + }, + { + "Effect": "Allow", + "Action": [ + "iam:CreateRole", + "iam:AttachRolePolicy", + "iam:PutRolePolicy", + "iam:GetRole", + "iam:PassRole" + ], + "Resource": "arn:aws:iam::*:role/AgentCoreEvalsSDK-*" + }, + { + "Effect": "Allow", + "Action": [ + "logs:DescribeLogGroups", + "logs:FilterLogEvents", + "logs:GetLogEvents" + ], + "Resource": "arn:aws:logs:*:*:log-group:/aws/bedrock-agentcore/evaluations/*" + }, + { + "Effect": "Allow", + "Action": [ + "bedrock:InvokeModel" + ], + "Resource": [ + "arn:aws:bedrock:*:*:inference-profile/*", + "arn:aws:bedrock:*::inference-profile/*" + ] + } + ] +} +``` + +--- + +## SDK Installation + +Install the AgentCore starter toolkit: + +```bash +pip install bedrock-agentcore-starter-toolkit +``` + +Initialize the evaluation client: + +```python +from bedrock_agentcore_starter_toolkit import Evaluation + +eval_client = Evaluation(region="us-east-1") +``` + +--- + +## Online (Live) Evaluations + +Online evaluations continuously monitor your agent by sampling a percentage of live sessions and running evaluators automatically. Results are written to CloudWatch Logs. + +```mermaid +sequenceDiagram + participant You + participant SDK as Evaluation SDK + participant AC as AgentCore + participant CW as CloudWatch Logs + + You->>SDK: create_online_config(agent_id, evaluators, sampling_rate) + SDK->>AC: CreateOnlineEvaluationConfig + AC-->>SDK: config_id, execution_status + SDK-->>You: Config created ✓ + + loop Every sampled session + AC->>CW: Read agent runtime logs + AC->>AC: Run LLM-as-judge evaluators + AC->>CW: Write evaluation results + end + + You->>SDK: list_online_configs() + SDK->>AC: ListOnlineEvaluationConfigs + AC-->>You: configs[] +``` + +### Create an Online Evaluation Config + +Use this when you want to start continuously monitoring your agent's quality in a given environment. + +> **Note:** `auto_create_execution_role=True` is what triggers automatic IAM role creation. The SDK creates a role named `AgentCoreEvalsSDK-{region}-{hash}` with permissions to read your agent's CloudWatch logs and invoke Bedrock models for LLM-as-judge scoring. + +```python +from bedrock_agentcore_starter_toolkit import Evaluation + +eval_client = Evaluation(region="us-east-1") + +# agent_id is the last segment of your runtime ARN +# e.g., "my-agent-abc123" from "arn:aws:bedrock-agentcore:us-east-1:123456789:runtime/my-agent-abc123" +agent_id = "my-agent-abc123" # Replace with your actual agent ID + +response = eval_client.create_online_config( + agent_id=agent_id, + config_name="production_eval", + sampling_rate=10.0, # Evaluate 10% of sessions + evaluator_list=[ + "Builtin.Helpfulness", + "Builtin.Correctness", + "Builtin.GoalSuccessRate", + ], + config_description="Production evaluation with core metrics", + auto_create_execution_role=True, # Creates IAM role automatically + enable_on_create=True, # Start evaluating immediately +) + +config_id = response["onlineEvaluationConfigId"] # Save this — you'll need it to query metrics +print(f"Created config: {config_id}") +print(f"Status: {response['executionStatus']}") +``` + +> **Timing:** Config creation typically completes in 10–30 seconds. Evaluation results start appearing in CloudWatch 2–5 minutes after your agent handles its next session. + +### Enable / Disable an Online Evaluation + +Use this to temporarily pause evaluations without losing the configuration (e.g., during maintenance windows or cost control). + +```python +# Disable evaluation (pause without deleting) +eval_client.update_online_config( + config_id="your-config-id", # Replace with actual config ID + execution_status="DISABLED" +) + +# Re-enable evaluation +eval_client.update_online_config( + config_id="your-config-id", + execution_status="ENABLED" +) +``` + +### Update Sampling Rate + +Use this to adjust cost vs. coverage tradeoff without recreating the config. + +```python +eval_client.update_online_config( + config_id="your-config-id", + sampling_rate=5.0 # Reduce to 5% for cost savings +) +``` + +### List All Configs + +```python +configs = eval_client.list_online_configs() +for config in configs.get("onlineEvaluationConfigs", []): + print(f"{config['onlineEvaluationConfigName']}: {config['executionStatus']} " + f"(sampling: {config.get('samplingRate', 'N/A')}%)") +``` + +### Delete a Config + +```python +eval_client.delete_online_config(config_id="your-config-id") +``` + +--- + +## On-Demand Evaluations + +On-demand evaluations let you evaluate a specific session immediately and get results back synchronously. The SDK retrieves the session's traces from AgentCore Observability automatically. + +```mermaid +sequenceDiagram + participant You + participant SDK as Evaluation SDK + participant Obs as Observability + participant AC as AgentCore + + You->>SDK: eval_client.run(agent_id, session_id, evaluators) + SDK->>Obs: Retrieve session traces & spans + Obs-->>SDK: trace_data + SDK->>AC: Evaluate(spans, evaluator_id) + AC->>AC: Run LLM-as-judge + AC-->>SDK: score, label, explanation + SDK-->>You: Results (synchronous) +``` + +### Evaluate a Single Session + +Use this to spot-check a specific conversation, debug a user-reported issue, or validate agent behavior after a prompt change. + +> **Timing:** Each evaluator takes 5–15 seconds per session. Running 3 evaluators on one session typically completes in 15–45 seconds. + +```python +from bedrock_agentcore_starter_toolkit import Evaluation + +eval_client = Evaluation(region="us-east-1") + +agent_id = "my-agent-abc123" # Replace with your actual agent ID +session_id = "session-456" # Replace with actual session ID + +# Run one or more evaluators against a session +results = eval_client.run( + agent_id=agent_id, + session_id=session_id, + evaluators=["Builtin.Helpfulness", "Builtin.Correctness"] +) + +for result in results.results: + print(f"Evaluator: {result.evaluator_name}") + print(f" Score: {result.value:.2f}") + print(f" Label: {result.label}") + print(f" Explanation: {result.explanation}") + if hasattr(result, 'token_usage') and result.token_usage: + print(f" Tokens: {result.token_usage}") + print() +``` + +### Evaluate Using the Low-Level boto3 API + +Use this when you need fine-grained control — for example, evaluating specific spans rather than an entire session, or when you've already retrieved spans yourself. + +```python +import boto3 + +bedrock_agentcore = boto3.client("bedrock-agentcore") + +response = bedrock_agentcore.evaluate( + evaluatorId="Builtin.Helpfulness", + evaluationInput={ + "sessionSpans": [ + { + "traceId": "abc123", + "spanId": "def456", + "name": "agent_response", + "startTimeUnixNano": 1708128000000000000, + "endTimeUnixNano": 1708128001000000000, + "attributes": { + "session.id": "session-456" + }, + "status": {"code": "OK"}, + "scope": {"name": "bedrock-agentcore"} + } + ] + }, + evaluationTarget={ + "traceIds": ["abc123"], # Optional: scope to specific traces + "spanIds": ["def456"] # Optional: scope to specific spans + } +) + +for result in response.get("evaluationResults", []): + print(f"Score: {result['value']}, Label: {result['label']}") + print(f"Explanation: {result['explanation']}") +``` + +### Batch Evaluation (Multiple Sessions) + +Use this to evaluate a set of sessions after a prompt change or deployment, to compare before/after quality. + +> **Timing:** Batch evaluation is sequential — expect ~10 seconds per session per evaluator. Evaluating 10 sessions with 2 evaluators takes roughly 3–4 minutes. + +```python +agent_id = "my-agent-abc123" +session_ids = ["session-001", "session-002", "session-003"] # Replace with actual IDs +evaluators = ["Builtin.Helpfulness", "Builtin.Correctness"] + +all_results = [] +for session_id in session_ids: + try: + results = eval_client.run( + agent_id=agent_id, + session_id=session_id, + evaluators=evaluators + ) + all_results.append({ + "session_id": session_id, + "results": [ + { + "evaluator": r.evaluator_name, + "score": r.value, + "label": r.label, + "explanation": r.explanation + } + for r in results.results + ] + }) + except Exception as e: + print(f"Failed to evaluate {session_id}: {e}") + +print(f"Evaluated {len(all_results)} sessions successfully") +``` + +--- + +## Built-in Evaluators Reference + +AgentCore provides 15 built-in evaluators across three evaluation levels. + +### Evaluation Level Decision Tree + +| Your Goal | Use This Level | Why | +|---|---|---| +| Measure overall task completion | SESSION | Evaluates the entire conversation end-to-end | +| Assess individual response quality | TRACE | Evaluates each agent turn independently | +| Validate tool usage correctness | TOOL_CALL | Evaluates specific tool invocations | + +### Quality & Relevance (TRACE level) + +| Evaluator ID | What It Measures | +|---|---| +| `Builtin.Helpfulness` | How useful and valuable the response is from the user's perspective | +| `Builtin.Correctness` | Whether the information in the response is factually accurate | +| `Builtin.Faithfulness` | Whether the response stays true to the provided context without hallucination | +| `Builtin.Coherence` | Logical flow and consistency of the response | +| `Builtin.Conciseness` | Whether the response is appropriately brief without unnecessary information | +| `Builtin.InstructionFollowing` | Whether the response adheres to all explicit instructions in the user's input | +| `Builtin.ContextRelevance` | How relevant the retrieved context is to the user's query | +| `Builtin.ResponseRelevance` | How well the response addresses the specific question or request | + +### Safety (TRACE level) + +| Evaluator ID | What It Measures | +|---|---| +| `Builtin.Harmfulness` | Detects potentially harmful or unsafe content in responses | +| `Builtin.Maliciousness` | Identifies malicious intent or attempts to manipulate users | +| `Builtin.Stereotyping` | Detects stereotypical or biased language in responses | +| `Builtin.Refusal` | Tracks when the agent appropriately refuses inappropriate requests | + +### Tool Usage (TOOL_CALL level) + +| Evaluator ID | What It Measures | +|---|---| +| `Builtin.ToolSelectionAccuracy` | Whether the agent selected the correct tools for the task | +| `Builtin.ToolParameterAccuracy` | Whether the agent used tools with correct parameters | + +### Session-Level + +| Evaluator ID | What It Measures | +|---|---| +| `Builtin.GoalSuccessRate` | Whether the agent successfully completed all user goals in the conversation | + +### Evaluator Selection Guide + +Use this matrix to decide which evaluators to enable based on your use case: + +| Use Case | Recommended Evaluators | +|---|---| +| General-purpose chatbot | `Helpfulness`, `Correctness`, `GoalSuccessRate` | +| RAG / knowledge retrieval | `Faithfulness`, `ContextRelevance`, `Correctness` | +| Tool-using agent | `ToolSelectionAccuracy`, `ToolParameterAccuracy`, `GoalSuccessRate` | +| Safety-critical application | `Harmfulness`, `Maliciousness`, `Stereotyping`, `Refusal` | +| Instruction-following tasks | `InstructionFollowing`, `Coherence`, `Conciseness` | + +### Listing Available Evaluators Programmatically + +```python +evaluators = eval_client.list_evaluators() +for evaluator in evaluators.get("evaluatorSummaries", []): + print(f"{evaluator['evaluatorId']}: {evaluator.get('evaluatorName', '')}") +``` + +--- + +## Custom Evaluators + +Use custom evaluators when built-in ones don't cover your domain-specific quality criteria (e.g., financial accuracy, medical safety, brand voice compliance). + +```python +import boto3 + +control_client = boto3.client("bedrock-agentcore-control") + +response = control_client.create_evaluator( + evaluatorName="domain_accuracy", + description="Evaluates domain-specific accuracy for financial queries", + evaluationLevel="TRACE", + inferenceConfig={ + "modelId": "us.anthropic.claude-sonnet-4-5-20250929-v1:0", + "maxTokens": 500, + "temperature": 1.0 + }, + instructions="""Evaluate the agent's response for domain-specific accuracy +in financial contexts. Consider: +1. Are financial terms used correctly? +2. Are calculations accurate? +3. Are regulatory references correct? + +{{input}} {{output}}""", + ratingScale={ + "type": "NUMERIC", + "min": 0.0, + "max": 1.0, + "description": "0 = completely inaccurate, 1 = fully accurate" + } +) + +evaluator_arn = response["evaluatorArn"] +print(f"Created custom evaluator: {evaluator_arn}") +``` + +Custom evaluators can then be used in both online and on-demand evaluations just like built-in ones. + +--- + +## Evaluation Results & Format + +### Online Evaluation Results (CloudWatch Logs) + +Online evaluation results are written to CloudWatch Logs at: + +``` +/aws/bedrock-agentcore/evaluations/results/{config-id} +``` + +> **Note:** The log group name uses the config ID (e.g., `a1b2c3d4-...`), not the config name. You can find the config ID from `create_online_config()` response or `list_online_configs()`. + +Each log event contains a JSON object with OpenTelemetry-style attributes: + +```json +{ + "attributes": { + "gen_ai.evaluation.name": "Builtin.Helpfulness", + "gen_ai.evaluation.score.value": 0.83, + "gen_ai.evaluation.score.label": "Very Helpful", + "gen_ai.evaluation.explanation": "The response directly addresses the user's question with relevant and actionable information..." + }, + "traceId": "abc123def456", + "spanId": "789ghi", + "sessionId": "session-456", + "timestamp": "2026-02-17T00:42:42.086Z" +} +``` + +### Key Fields + +| Field | Description | +|---|---| +| `attributes.gen_ai.evaluation.name` | Evaluator ID (e.g., `Builtin.Helpfulness`) | +| `attributes.gen_ai.evaluation.score.value` | Numeric score from 0.0 to 1.0 | +| `attributes.gen_ai.evaluation.score.label` | Human-readable label (e.g., "Very Helpful") | +| `attributes.gen_ai.evaluation.explanation` | Detailed reasoning for the score | +| `traceId` | The trace that was evaluated | +| `spanId` | The specific span that was evaluated | +| `sessionId` | The agent session ID | + +### On-Demand Evaluation Results + +On-demand results are returned synchronously in the API response: + +```json +{ + "evaluatorId": "Builtin.Helpfulness", + "evaluatorName": "Builtin.Helpfulness", + "value": 0.83, + "label": "Very Helpful", + "explanation": "The response directly addresses...", + "tokenUsage": { + "inputTokens": 958, + "outputTokens": 211, + "totalTokens": 1169 + } +} +``` + +--- + +## Downloading & Querying Results + +### Query Results from CloudWatch (Python) + +Use this to pull evaluation results for analysis, export, or dashboard display. + +> **CloudWatch Logs quota:** `FilterLogEvents` is limited to 5 requests per second per account per region. For large result sets, add pagination delays or use CloudWatch Logs Insights for faster queries. + +```python +import boto3 +import json +from datetime import datetime, timedelta + +cloudwatch_logs = boto3.client("logs") + +config_id = "your-config-id" # Replace with actual config ID from create/list +log_group = f"/aws/bedrock-agentcore/evaluations/results/{config_id}" + +end_time = datetime.utcnow() +start_time = end_time - timedelta(days=7) + +results = [] +next_token = None + +while True: + params = { + "logGroupName": log_group, + "startTime": int(start_time.timestamp() * 1000), + "endTime": int(end_time.timestamp() * 1000), + "limit": 10000, + } + if next_token: + params["nextToken"] = next_token + + response = cloudwatch_logs.filter_log_events(**params) + + for event in response.get("events", []): + try: + log_data = json.loads(event["message"]) + results.append(log_data) + except json.JSONDecodeError: + continue + + next_token = response.get("nextToken") + if not next_token: + break + +print(f"Retrieved {len(results)} evaluation results") +``` + +### Export Results to CSV + +Use this to share results with stakeholders or import into spreadsheet tools. + +```python +import csv + +with open("evaluation_results.csv", "w", newline="") as f: + writer = csv.writer(f) + writer.writerow(["session_id", "trace_id", "evaluator", "score", "label", "explanation"]) + + for result in results: + attrs = result.get("attributes", {}) + writer.writerow([ + result.get("sessionId", ""), + result.get("traceId", ""), + attrs.get("gen_ai.evaluation.name", ""), + attrs.get("gen_ai.evaluation.score.value", ""), + attrs.get("gen_ai.evaluation.score.label", ""), + attrs.get("gen_ai.evaluation.explanation", ""), + ]) +``` + +### Compute Aggregate Metrics from Results + +Use this to build summary statistics for dashboards or reports. + +```python +from collections import defaultdict + +evaluator_metrics = defaultdict(lambda: {"count": 0, "total_score": 0.0}) + +for result in results: + attrs = result.get("attributes", {}) + evaluator = attrs.get("gen_ai.evaluation.name", "unknown") + score = attrs.get("gen_ai.evaluation.score.value", 0.0) + + evaluator_metrics[evaluator]["count"] += 1 + evaluator_metrics[evaluator]["total_score"] += score + +# Print per-evaluator averages +for evaluator, metrics in evaluator_metrics.items(): + avg = metrics["total_score"] / metrics["count"] if metrics["count"] > 0 else 0 + print(f"{evaluator}: avg={avg:.2f} ({metrics['count']} evaluations)") +``` + +### Score Distribution + +Use this to understand the shape of your score data — are most scores clustered high, or is there a long tail of low scores? + +```python +def compute_score_distribution(results): + """Compute score distribution across standard bins.""" + bins = { + "0.0-0.2": 0, + "0.2-0.4": 0, + "0.4-0.6": 0, + "0.6-0.8": 0, + "0.8-1.0": 0, + } + + for result in results: + score = result.get("attributes", {}).get("gen_ai.evaluation.score.value", 0.0) + if score < 0.2: + bins["0.0-0.2"] += 1 + elif score < 0.4: + bins["0.2-0.4"] += 1 + elif score < 0.6: + bins["0.4-0.6"] += 1 + elif score < 0.8: + bins["0.6-0.8"] += 1 + else: + bins["0.8-1.0"] += 1 + + return bins + +distribution = compute_score_distribution(results) +for range_label, count in distribution.items(): + print(f" {range_label}: {count}") +``` + +--- + +## AI-Powered Analysis + +Once you have evaluation results, you can use Amazon Bedrock foundation models to automatically analyze patterns in low-scoring evaluations and generate system prompt improvements. This closes the feedback loop: + +``` +┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ +│ Evaluate │────▶│ Identify │────▶│ Improve │────▶│ Deploy & │ +│ Agent │ │ Patterns │ │ Prompt │ │ Re-evaluate│ +└─────────────┘ └─────────────┘ └─────────────┘ └──────┬──────┘ + ▲ │ + └───────────────────────────────────────────────────────────┘ + Continuous improvement loop +``` + +```mermaid +sequenceDiagram + participant You + participant CW as CloudWatch Logs + participant FM as Foundation Model (Bedrock) + + You->>CW: Query low-scoring evaluation results + CW-->>You: Evaluation logs (scores, explanations) + You->>FM: Analyze patterns in low-scoring results + FM-->>You: Patterns, summary, recommendations + You->>FM: Generate improved prompt based on patterns + FM-->>You: Improved prompt + change explanations + You->>You: Review, apply, and re-evaluate +``` + +### Pattern Analysis + +Use this after you've accumulated evaluation results and want to understand why your agent is scoring low on certain evaluators. + +> **Timing:** Pattern analysis with ~50 evaluation results typically takes 15–30 seconds depending on model and input size. + +```python +import boto3 +import json + +bedrock_runtime = boto3.client("bedrock-runtime") + +def analyze_evaluation_patterns(low_scoring_results: list[dict]) -> dict: + """ + Analyze low-scoring evaluation results to identify failure patterns. + + Args: + low_scoring_results: Evaluation result logs from CloudWatch + (filtered to scores below your threshold, e.g., <= 0.5) + + Returns: + Analysis with patterns, summary, and recommendations + """ + # Format results for the model — include evaluator name, score, and explanation + formatted = [] + for result in low_scoring_results[:50]: # Limit to avoid token overflow + attrs = result.get("attributes", {}) + formatted.append({ + "session_id": attrs.get("session.id", "unknown"), + "evaluator": attrs.get("gen_ai.evaluation.name", "unknown"), + "score": attrs.get("gen_ai.evaluation.score.value", 0.0), + "label": attrs.get("gen_ai.evaluation.score.label", ""), + "explanation": attrs.get("gen_ai.evaluation.explanation", ""), + }) + + system_prompt = """You are an expert at analyzing agent evaluation data to identify +patterns and root causes of poor performance. For each pattern you identify: +1. Describe the pattern clearly +2. Count how frequently it occurs +3. List affected session IDs +4. Provide concrete evidence from the evaluation explanations + +Return JSON: {"patterns": [...], "summary": "...", "recommendations": [...]}""" + + response = bedrock_runtime.invoke_model( + modelId="us.anthropic.claude-sonnet-4-5-20250929-v1:0", + body=json.dumps({ + "anthropic_version": "bedrock-2023-05-31", + "max_tokens": 4096, + "system": system_prompt, + "messages": [{"role": "user", "content": f""" +Analyze these {len(formatted)} low-scoring evaluations and identify common patterns: + +{json.dumps(formatted, indent=2)} + +Focus on: which evaluators score low consistently, common issues in explanations, +and actionable patterns across sessions."""}], + "temperature": 0.7, + }), + ) + + response_body = json.loads(response["body"].read()) + analysis_text = response_body["content"][0]["text"] + + # Parse JSON from response (strip markdown fences if present) + analysis_text = analysis_text.strip().strip("`").removeprefix("json").strip() + return json.loads(analysis_text) +``` + +### System Prompt Improvement + +Use this after pattern analysis to automatically generate a better system prompt that addresses the identified issues. + +> **Timing:** Prompt improvement generation typically takes 20–60 seconds, as the model produces a complete revised prompt with explanations. + +```python +def generate_prompt_improvement(current_prompt: str, analysis: dict) -> dict: + """ + Generate an improved system prompt based on analysis of failure patterns. + + Args: + current_prompt: The agent's current system prompt + analysis: Output from analyze_evaluation_patterns() + + Returns: + Dict with improvedPrompt and list of changes with reasoning + """ + system_prompt = """You are an expert at improving system prompts for AI agents. +Generate specific improvements based on identified failure patterns. +For each change, explain the reasoning and expected impact. + +Return JSON: {"improvedPrompt": "...", "changes": [{"section": "...", +"reasoning": "...", "impact": "..."}]}""" + + # Truncate pattern evidence to stay within token limits + analysis_summary = { + "summary": analysis.get("summary", ""), + "patterns": [ + { + "pattern": p["pattern"], + "frequency": p["frequency"], + "evidence": p["evidence"][:500], + } + for p in analysis.get("patterns", [])[:10] + ], + "recommendations": analysis.get("recommendations", [])[:10], + } + + response = bedrock_runtime.invoke_model( + modelId="us.anthropic.claude-sonnet-4-5-20250929-v1:0", + body=json.dumps({ + "anthropic_version": "bedrock-2023-05-31", + "max_tokens": 8192, + "system": system_prompt, + "messages": [{"role": "user", "content": f""" +Current System Prompt: +{current_prompt} + +Performance Analysis: +{json.dumps(analysis_summary, indent=2)} + +Generate an improved prompt that addresses these issues."""}], + "temperature": 0.7, + }), + ) + + response_body = json.loads(response["body"].read()) + result_text = response_body["content"][0]["text"] + result_text = result_text.strip().strip("`").removeprefix("json").strip() + return json.loads(result_text) +``` + +### End-to-End Workflow + +Putting it all together — from querying results to generating an improved prompt: + +```python +# 1. Query low-scoring evaluation results (see "Downloading & Querying Results") +results = query_evaluation_results("your-config-id", days=30) +low_scoring = [ + r for r in results + if r.get("attributes", {}).get("gen_ai.evaluation.score.value", 1.0) <= 0.5 +] +print(f"Found {len(low_scoring)} low-scoring evaluations") + +# 2. Analyze patterns (~15-30 seconds) +analysis = analyze_evaluation_patterns(low_scoring) +print(f"Identified {len(analysis['patterns'])} patterns") +for pattern in analysis["patterns"]: + print(f" - {pattern['pattern']} (frequency: {pattern['frequency']})") + +# 3. Generate prompt improvement (~20-60 seconds) +current_prompt = "You are a helpful assistant..." # your agent's current prompt +improvement = generate_prompt_improvement(current_prompt, analysis) +print(f"\nSuggested {len(improvement['changes'])} changes:") +for change in improvement["changes"]: + print(f" Section: {change['section']}") + print(f" Reasoning: {change['reasoning']}") + print(f" Impact: {change['impact']}\n") + +# 4. Review and apply the improved prompt +print("Improved prompt:") +print(improvement["improvedPrompt"]) +``` + +--- + +## Retrieving Sessions, Traces & Spans (Observability) + +The `bedrock-agentcore-starter-toolkit` includes an Observability client that retrieves session data (traces and spans) from AgentCore. This is useful for building session explorers, trace viewers, and feeding data into evaluations. + +### Session / Trace / Span Hierarchy + +``` +Session (conversation) +├── Trace 1 (user turn) +│ ├── Span: agent_planning (root) +│ ├── Span: tool_call → search_api +│ ├── Span: tool_result ← search_api +│ └── Span: agent_response +├── Trace 2 (user turn) +│ ├── Span: agent_planning (root) +│ └── Span: agent_response +└── Trace 3 (user turn) + ├── Span: agent_planning (root) + ├── Span: tool_call → database_query + ├── Span: tool_result ← database_query + └── Span: agent_response +``` + +### Initialize the Observability Client + +```python +from bedrock_agentcore_starter_toolkit import Observability + +agent_id = "my-agent-abc123" # Replace with your actual agent ID +obs_client = Observability(agent_id=agent_id, region="us-east-1") +``` + +### List Traces and Spans for a Session + +Use this to inspect what happened during a specific conversation — which tools were called, how long each step took, and whether any errors occurred. + +```python +trace_data = obs_client.list(session_id="session-456") # Replace with actual session ID + +# trace_data.traces → dict: {trace_id: [list of spans]} +# trace_data.spans → flat list of all spans across all traces +# trace_data.start_time → session start time in nanoseconds + +print(f"Traces: {len(trace_data.traces)}") +print(f"Total spans: {len(trace_data.spans)}") + +for trace_id, spans in trace_data.traces.items(): + print(f"\nTrace {trace_id}: {len(spans)} spans") + for span in sorted(spans, key=lambda s: s.start_time_unix_nano or 0): + print(f" {span.span_name} ({span.duration_ms}ms) " + f"parent={span.parent_span_id or 'root'}") +``` + +### Span Properties + +Each span object returned by the SDK has these key properties: + +| Property | Type | Description | +|---|---|---| +| `span_id` | `str` | Unique span identifier | +| `trace_id` | `str` | Parent trace identifier | +| `parent_span_id` | `str \| None` | Parent span ID (`None` for root spans) | +| `span_name` | `str` | Name of the operation (e.g., `"agent_response"`, `"tool_call"`) | +| `start_time_unix_nano` | `int` | Start time in nanoseconds since epoch | +| `end_time_unix_nano` | `int` | End time in nanoseconds since epoch | +| `duration_ms` | `float` | Duration in milliseconds | +| `status_code` | `str` | Status (`"OK"`, `"ERROR"`, `"UNSET"`) | +| `attributes` | `dict` | OpenTelemetry attributes (model ID, token counts, etc.) | + +### Formatting Trace Data for Display + +Use this helper to convert SDK trace data into a JSON-serializable structure suitable for API responses or UI rendering. + +```python +from datetime import datetime + +def format_session_for_display(trace_data) -> dict: + """Format SDK trace data into a JSON-serializable structure.""" + formatted_traces = [] + + for trace_id, spans in trace_data.traces.items(): + if not spans: + continue + + spans.sort(key=lambda s: s.start_time_unix_nano or 0) + + trace_start = min(s.start_time_unix_nano for s in spans if s.start_time_unix_nano) + trace_end = max(s.end_time_unix_nano for s in spans if s.end_time_unix_nano) + + formatted_traces.append({ + "traceId": trace_id, + "startTime": datetime.fromtimestamp(trace_start / 1e9).isoformat(), + "endTime": datetime.fromtimestamp(trace_end / 1e9).isoformat(), + "durationMs": (trace_end - trace_start) / 1e6, + "spans": [ + { + "spanId": s.span_id, + "traceId": s.trace_id, + "parentSpanId": s.parent_span_id, + "name": s.span_name, + "startTime": datetime.fromtimestamp(s.start_time_unix_nano / 1e9).isoformat(), + "endTime": datetime.fromtimestamp(s.end_time_unix_nano / 1e9).isoformat(), + "durationMs": s.duration_ms, + "status": s.status_code or "UNSET", + "attributes": s.attributes or {}, + } + for s in spans + ], + }) + + formatted_traces.sort(key=lambda t: t["startTime"]) + return { + "traceCount": len(formatted_traces), + "spanCount": len(trace_data.spans), + "traces": formatted_traces, + } +``` + +--- + +## Building a Dashboard + +With evaluation results and observability data, you can build a dashboard to monitor agent performance. Here's a high-level overview of the key components and data flow. + +### Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Dashboard UI │ +│ │ +│ ┌───────────────┐ ┌──────────────┐ ┌────────────────────┐ │ +│ │ Summary Tiles │ │ Score Dist. │ │ Per-Evaluator │ │ +│ │ • Total evals │ │ Chart │ │ Metrics │ │ +│ │ • Avg score │ │ (bar chart) │ │ (color-coded) │ │ +│ │ • Low/High │ │ │ │ │ │ +│ └───────────────┘ └──────────────┘ └────────────────────┘ │ +│ │ +│ ┌──────────────────────────┐ ┌───────────────────────────┐ │ +│ │ Session Explorer │ │ On-Demand Eval Panel │ │ +│ │ • Browse sessions │ │ • Select evaluators │ │ +│ │ • Filter by score/date │ │ • Run against session │ │ +│ │ • View trace timeline │ │ • View scores/explanations│ │ +│ └──────────────────────────┘ └───────────────────────────┘ │ +│ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ AI Analysis Panel │ │ +│ │ • Trigger pattern analysis on low-scoring results │ │ +│ │ • View identified patterns and recommendations │ │ +│ │ • Generate and review prompt improvements │ │ +│ └──────────────────────────────────────────────────────────┘ │ +└──────────────────────────────┬──────────────────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────────────┐ +│ Backend API │ +│ │ +│ Evaluation SDK CloudWatch Logs Observability SDK │ +│ • list_online_configs • filter_log_events • obs.list() │ +│ • run() (on-demand) • (eval results) • (traces/spans) │ +│ • create/update/delete │ +└──────────────────────────────────────────────────────────────────┘ +``` + +### Key Dashboard Components + +| Component | Data Source | What It Shows | +|---|---|---| +| Summary Tiles | Aggregated CloudWatch results | Total evaluations, average score, low-score count (< 0.5), high-score count (≥ 0.8) | +| Score Distribution | Aggregated CloudWatch results | Bar chart of scores across bins (0.0–0.2, 0.2–0.4, etc.) | +| Per-Evaluator Metrics | Aggregated CloudWatch results | Each evaluator's average score and count, color-coded (green ≥ 0.8, yellow ≥ 0.6, red < 0.6) | +| Session Explorer | Observability SDK | Browsable list of sessions with scores, trace counts, timestamps | +| Trace Viewer | Observability SDK | Timeline visualization of traces and spans, showing parent-child hierarchy | +| On-Demand Eval Panel | Evaluation SDK `run()` | Select evaluators, run against a session, display scores and explanations | +| AI Analysis Panel | Bedrock `invoke_model` | Pattern analysis, recommendations, prompt improvement suggestions | + +### Data Flow + +```mermaid +graph TD + A[Dashboard loads] --> B[Fetch evaluation configs] + B --> C{Configs found?} + C -->|Yes| D[Fetch metrics per config from CloudWatch] + C -->|No| E[Show setup banner] + D --> F[Aggregate across configs] + F --> G[Display summary tiles + charts + evaluator breakdown] + + H[User selects session] --> I[Fetch traces via Observability SDK] + I --> J[Display trace timeline viewer] + + K[User clicks Run Evaluation] --> L[Call eval_client.run per evaluator] + L --> M[Display scores + explanations] + + N[User triggers AI Analysis] --> O[Query low-scoring results from CloudWatch] + O --> P[Send to foundation model for pattern analysis] + P --> Q[Display patterns + recommendations] + Q --> R[Generate prompt improvement] +``` + +### Aggregating Metrics Across Configs + +If you have multiple evaluation configs (e.g., separate configs for different evaluator sets), aggregate their metrics for a unified dashboard view. + +```python +def aggregate_metrics(all_config_metrics: list[dict]) -> dict: + """Combine metrics from multiple evaluation configs.""" + total_evals = 0 + weighted_score_sum = 0.0 + combined_distribution = {} + combined_evaluators = {} + + for metrics in all_config_metrics: + count = metrics["totalEvaluations"] + if count == 0: + continue + + total_evals += count + weighted_score_sum += metrics["averageScore"] * count + + for bin_key, bin_count in metrics.get("scoreDistribution", {}).items(): + combined_distribution[bin_key] = combined_distribution.get(bin_key, 0) + bin_count + + for eval_id, eval_metrics in metrics.get("evaluatorMetrics", {}).items(): + if eval_id not in combined_evaluators: + combined_evaluators[eval_id] = {"count": 0, "totalScore": 0.0} + combined_evaluators[eval_id]["count"] += eval_metrics["count"] + combined_evaluators[eval_id]["totalScore"] += eval_metrics["totalScore"] + + # Compute averages + for eval_id in combined_evaluators: + m = combined_evaluators[eval_id] + m["averageScore"] = m["totalScore"] / m["count"] if m["count"] > 0 else 0.0 + + return { + "totalEvaluations": total_evals, + "averageScore": weighted_score_sum / total_evals if total_evals > 0 else 0.0, + "scoreDistribution": combined_distribution, + "evaluatorMetrics": combined_evaluators, + } +``` + +--- + +## Best Practices + +### Cost Optimization + +#### Sampling Rate Strategy + +| Environment | Recommended Rate | Rationale | +|---|---|---| +| Development | 100% | Full visibility during testing | +| Staging | 25–50% | Good coverage for QA | +| Production | 5–10% | Cost-effective monitoring | + +#### Evaluator Selection Strategy + +Start with a small set of core evaluators and expand as needed: + +1. **Start with 3 core evaluators**: `Helpfulness`, `Correctness`, `GoalSuccessRate` +2. **Add safety evaluators for production**: `Harmfulness`, `Maliciousness` +3. **Add quality evaluators as needed**: `Faithfulness`, `Coherence`, `InstructionFollowing` +4. **Add tool evaluators if using tools**: `ToolSelectionAccuracy`, `ToolParameterAccuracy` + +Each online evaluation config supports up to 10 evaluators. + +#### Token Usage + +- Built-in evaluators use efficient prompts (~1,000 tokens per evaluation) +- SESSION-level evaluators cost more (evaluate entire conversation) +- TRACE-level evaluators are more granular (per response) +- TOOL_CALL-level evaluators are the most targeted (per tool invocation) + +#### CloudWatch Logs Costs + +Online evaluation results are stored in CloudWatch Logs, which incurs: +- **Ingestion:** $0.50 per GB ingested +- **Storage:** $0.03 per GB per month (default retention: never expire) + +To control costs, set a retention policy on evaluation log groups: + +```python +cloudwatch_logs.put_retention_policy( + logGroupName=f"/aws/bedrock-agentcore/evaluations/results/{config_id}", + retentionInDays=90 # Keep results for 90 days +) +``` + +### Operational Tips + +#### Time Estimates for Common Operations + +| Operation | Typical Duration | +|---|---| +| Create online evaluation config | 10–30 seconds | +| First results appear in CloudWatch | 2–5 minutes after next agent session | +| On-demand evaluation (1 evaluator, 1 session) | 5–15 seconds | +| On-demand evaluation (3 evaluators, 1 session) | 15–45 seconds | +| Batch evaluation (10 sessions × 2 evaluators) | 3–4 minutes | +| AI pattern analysis (~50 results) | 15–30 seconds | +| AI prompt improvement generation | 20–60 seconds | + +#### Naming Conventions + +Use consistent naming across your codebase: + +| Concept | Python (SDK) | JSON (API responses) | +|---|---|---| +| Agent identifier | `agent_id` | `agentId` | +| Config identifier | `config_id` | `onlineEvaluationConfigId` | +| Session identifier | `session_id` | `sessionId` | +| Evaluator identifier | `evaluator_id` | `evaluatorId` | + +> **Note:** The SDK uses `snake_case` for parameters. API responses from AgentCore use `camelCase`. The examples in this guide use `snake_case` for Python variables and `camelCase` when showing JSON responses. + +#### Troubleshooting + +| Symptom | Likely Cause | Fix | +|---|---|---| +| No metrics appearing | Results take 2–5 min to appear | Wait and refresh; verify `executionStatus` is `ENABLED` | +| 403 on evaluation API calls | Missing IAM permissions | Add `bedrock-agentcore:*` to your role | +| `ResourceNotFoundException` on log group | No evaluations have run yet | Invoke your agent to generate sessions, then wait | +| Low evaluation counts vs. session count | Sampling rate too low | Increase `sampling_rate` or generate more agent traffic | +| `FilterLogEvents` throttled | CloudWatch 5 req/sec limit | Add pagination delays or use CloudWatch Logs Insights | + +--- + +## References + +- [AgentCore Evaluations Documentation](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/evaluations.html) +- [CreateOnlineEvaluationConfig API](https://docs.aws.amazon.com/bedrock-agentcore-control/latest/APIReference/API_CreateOnlineEvaluationConfig.html) +- [Evaluate API](https://docs.aws.amazon.com/bedrock-agentcore/latest/APIReference/API_Evaluate.html) +- [AgentCore Samples Repository](https://github.com/awslabs/amazon-bedrock-agentcore-samples) +- [Fullstack Solution Template for AgentCore](https://github.com/awslabs/amazon-bedrock-agentcore-samples/tree/main/fullstack-solution-template-for-agentcore) +- [CloudWatch Logs Pricing](https://aws.amazon.com/cloudwatch/pricing/) +- [Bedrock Supported Regions](https://docs.aws.amazon.com/bedrock/latest/userguide/bedrock-regions.html) From d6e6e1c135149e3873e9c8d8825f833ab7935511 Mon Sep 17 00:00:00 2001 From: Meghana Ashok Date: Wed, 18 Mar 2026 11:16:41 -0700 Subject: [PATCH 2/2] Add Agentcore evaluation integration documents --- CONTRIBUTORS.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md index 52b44df..297cdd6 100644 --- a/CONTRIBUTORS.md +++ b/CONTRIBUTORS.md @@ -14,3 +14,6 @@ - Andrew Ang - Tesfagabir Meharizghi - Gauhar Bains +- Meghana Ashok +- Suren Gunturu +- Samaneh Aminikhanghahi \ No newline at end of file