|
| 1 | +# Ground Truth Evaluations with Custom Evaluators |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +This tutorial demonstrates end-to-end evaluation of an agentic application using |
| 6 | +[**Amazon Bedrock AgentCore Evaluations**](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/evaluations.html) with ground-truth reference inputs. It covers |
| 7 | +the two primary evaluation interfaces — `EvaluationClient` and |
| 8 | +`OnDemandEvaluationDatasetRunner` — and shows how to create **custom LLM-as-a-judge |
| 9 | +evaluators** that use ground-truth placeholders to tailor scoring criteria to your |
| 10 | +application domain. |
| 11 | + |
| 12 | +The tutorial deploys an **HR Assistant agent** for Acme Corp — a |
| 13 | +[Strands Agents](https://strandsagents.com/) application that helps employees with PTO |
| 14 | +management, HR policy lookups, benefits information, and pay stub retrieval. Its tools |
| 15 | +return deterministic mock data, making evaluation results fully reproducible. |
| 16 | + |
| 17 | +### Key concepts covered |
| 18 | + |
| 19 | +| Concept | Description | |
| 20 | +|---|---| |
| 21 | +| `EvaluationClient` | Evaluate specific existing CloudWatch sessions against ground-truth references | |
| 22 | +| `OnDemandEvaluationDatasetRunner` | Define a test dataset, auto-invoke the agent per scenario, and evaluate the results | |
| 23 | +| `ReferenceInputs` | Supply `expected_response`, `expected_trajectory`, and `assertions` as ground truth | |
| 24 | +| Custom evaluators | Create LLM-as-a-judge evaluators with domain-specific instructions and ground-truth placeholders | |
| 25 | + |
| 26 | + |
| 27 | +> **Further reading** |
| 28 | +> - [Ground-truth evaluations — custom evaluators](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/ground-truth-evaluations.html#gt-custom-evaluators) |
| 29 | +> - [Dataset-based evaluations](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/dataset-evaluations.html) |
| 30 | +
|
| 31 | +--- |
| 32 | + |
| 33 | +## Architecture |
| 34 | + |
| 35 | +``` |
| 36 | +┌─────────────────────────────────────────────────────────────────────────┐ |
| 37 | +│ Tutorial Notebook (groundtruth_evaluations.ipynb) │ |
| 38 | +│ │ |
| 39 | +│ Step 1 ──► bedrock-agentcore-starter-toolkit │ |
| 40 | +│ │ CodeBuild builds image, pushes to ECR │ |
| 41 | +│ └──► AgentCore Runtime (HR Assistant Agent) │ |
| 42 | +│ │ invoke_agent_runtime() │ |
| 43 | +│ Step 2 ──► bedrock-agentcore-control ──► Custom Evaluators │ |
| 44 | +│ create_evaluator() │ |
| 45 | +│ │ |
| 46 | +│ Step 3 ──► AgentCore Runtime (generate sessions) │ |
| 47 | +│ │ OTel spans ──► CloudWatch Logs │ |
| 48 | +│ │ |
| 49 | +│ Step 4 ──► EvaluationClient.run() │ |
| 50 | +│ │ CloudWatchAgentSpanCollector reads spans │ |
| 51 | +│ └──► Evaluate API ──► Built-in + Custom Evaluators │ |
| 52 | +│ └──► Scores & Explanations │ |
| 53 | +│ │ |
| 54 | +│ Step 5 ──► OnDemandEvaluationDatasetRunner.run() │ |
| 55 | +│ │ Invokes agent per scenario │ |
| 56 | +│ │ Waits for CloudWatch ingestion │ |
| 57 | +│ └──► Evaluate API ──► Built-in + Custom Evaluators │ |
| 58 | +│ └──► Per-scenario Results │ |
| 59 | +└─────────────────────────────────────────────────────────────────────────┘ |
| 60 | +``` |
| 61 | + |
| 62 | +**Component roles** |
| 63 | + |
| 64 | +| Component | Role | |
| 65 | +|---|---| |
| 66 | +| AgentCore Runtime | Hosts the containerised HR Assistant, emits OTel spans to CloudWatch | |
| 67 | +| CloudWatch Logs | Stores session spans; queried by `CloudWatchAgentSpanCollector` | |
| 68 | +| `bedrock-agentcore-control` | Control plane — creates custom evaluators and agent runtimes | |
| 69 | +| Evaluate API (`bedrock-agentcore`) | Data plane — scores sessions against evaluator definitions | |
| 70 | +| Starter Toolkit | Builds the Docker image via CodeBuild and registers the runtime; no local Docker required | |
| 71 | + |
| 72 | +--- |
| 73 | + |
| 74 | +## Prerequisites |
| 75 | + |
| 76 | +- **Python 3.10+** with the packages in `requirements.txt` |
| 77 | +- **AWS credentials** configured (e.g. via `aws configure` or environment variables) with |
| 78 | + permissions for: |
| 79 | + - `bedrock-agentcore:*` — invoke agent runtime and call Evaluate API |
| 80 | + - `bedrock-agentcore-control:CreateAgentRuntime`, `UpdateAgentRuntime`, |
| 81 | + `GetAgentRuntime`, `CreateEvaluator` — deploy agent and register evaluators |
| 82 | + - `logs:FilterLogEvents`, `logs:DescribeLogGroups`, `logs:StartQuery`, |
| 83 | + `logs:GetQueryResults` — read CloudWatch spans |
| 84 | + - `ecr:GetAuthorizationToken`, `ecr:BatchCheckLayerAvailability`, |
| 85 | + `ecr:InitiateLayerUpload`, `ecr:PutImage` — push container image |
| 86 | + - `codebuild:StartBuild`, `codebuild:BatchGetBuilds` — image build via CodeBuild |
| 87 | + - `iam:CreateRole`, `iam:AttachRolePolicy`, `iam:PassRole` — auto-create execution roles |
| 88 | + - `s3:PutObject`, `s3:GetObject` — CodeBuild source upload |
| 89 | +- **No local Docker required** — the starter toolkit builds the container image via |
| 90 | + AWS CodeBuild |
| 91 | + |
| 92 | +Install dependencies: |
| 93 | + |
| 94 | +```bash |
| 95 | +pip install -r requirements.txt |
| 96 | +``` |
| 97 | + |
| 98 | +--- |
| 99 | + |
| 100 | +## Usage |
| 101 | + |
| 102 | +### Run the notebook |
| 103 | + |
| 104 | +Open and run [`groundtruth_evaluations.ipynb`](groundtruth_evaluations.ipynb) top-to-bottom. |
| 105 | +Each cell is idempotent — re-running the notebook updates the existing agent runtime and |
| 106 | +creates fresh custom evaluators with a unique suffix to avoid naming conflicts. |
| 107 | + |
| 108 | +```bash |
| 109 | +jupyter notebook groundtruth_evaluations.ipynb |
| 110 | +``` |
| 111 | + |
| 112 | +Or execute non-interactively: |
| 113 | + |
| 114 | +```bash |
| 115 | +jupyter nbconvert --to notebook --execute --inplace groundtruth_evaluations.ipynb |
| 116 | +``` |
| 117 | + |
| 118 | +### Notebook walkthrough |
| 119 | + |
| 120 | +| Step | Cell(s) | What happens | |
| 121 | +|---|---|---| |
| 122 | +| **1 — Install** | `install` | Installs `bedrock-agentcore`, `strands-agents`, and other dependencies | |
| 123 | +| **2 — Configure** | `setup` | Creates a boto3 session and sets `REGION` | |
| 124 | +| **3a — Deploy agent** | `nn72gdo2s4h`, `deploy`, `wait-deploy`, `agent-config` | Writes `hr_assistant_agent.py`, builds image via CodeBuild, creates/updates the AgentCore Runtime, polls until `READY` | |
| 125 | +| **3b — Create evaluators** | `76hyptexblj` | Creates `HRResponseSimilarity` (TRACE) and `HRAssertionChecker` (SESSION) custom evaluators via `bedrock-agentcore-control` | |
| 126 | +| **4 — Invoke agent** | `invoke-single`, `invoke-multi`, `invoke-onboard` | Runs 5 sessions (single- and multi-turn), waits 60 s for CloudWatch ingestion | |
| 127 | +| **5 — EvaluationClient** | `ec-*` | Evaluates each session by session ID using built-in and custom evaluators | |
| 128 | +| **6 — DatasetRunner** | `runner-*` | Defines a 5-scenario dataset, invokes the agent per scenario, waits 180 s, evaluates all scenarios | |
| 129 | +| **7 — Cleanup** | `cleanup` | (Commented out) Deletes the agent runtime | |
| 130 | + |
| 131 | +### Using `EvaluationClient` directly |
| 132 | + |
| 133 | +```python |
| 134 | +from bedrock_agentcore.evaluation import EvaluationClient, ReferenceInputs |
| 135 | +from datetime import timedelta |
| 136 | + |
| 137 | +ec = EvaluationClient(region_name="us-east-1") |
| 138 | + |
| 139 | +results = ec.run( |
| 140 | + evaluator_ids=["Builtin.Correctness", "Builtin.GoalSuccessRate", MY_CUSTOM_EVAL_ID], |
| 141 | + session_id="<session-id>", |
| 142 | + agent_id="<agent-id>", |
| 143 | + look_back_time=timedelta(hours=2), |
| 144 | + reference_inputs=ReferenceInputs( |
| 145 | + expected_response="Employee EMP-001 has 10 remaining PTO days.", |
| 146 | + assertions=["Agent called get_pto_balance", "Agent reported 10 remaining days"], |
| 147 | + expected_trajectory=["get_pto_balance"], |
| 148 | + ), |
| 149 | +) |
| 150 | +``` |
| 151 | + |
| 152 | +### Using `OnDemandEvaluationDatasetRunner` directly |
| 153 | + |
| 154 | +```python |
| 155 | +from bedrock_agentcore.evaluation import ( |
| 156 | + Dataset, PredefinedScenario, Turn, |
| 157 | + EvaluationRunConfig, EvaluatorConfig, |
| 158 | + OnDemandEvaluationDatasetRunner, |
| 159 | + CloudWatchAgentSpanCollector, |
| 160 | +) |
| 161 | + |
| 162 | +dataset = Dataset(scenarios=[ |
| 163 | + PredefinedScenario( |
| 164 | + scenario_id="pto-check", |
| 165 | + turns=[Turn( |
| 166 | + input="What is the PTO balance for EMP-001?", |
| 167 | + expected_response="EMP-001 has 10 remaining PTO days.", |
| 168 | + )], |
| 169 | + expected_trajectory=["get_pto_balance"], |
| 170 | + assertions=["Agent reported 10 remaining PTO days"], |
| 171 | + ), |
| 172 | +]) |
| 173 | + |
| 174 | +runner = OnDemandEvaluationDatasetRunner(region="us-east-1") |
| 175 | +result = runner.run( |
| 176 | + config=EvaluationRunConfig( |
| 177 | + evaluator_config=EvaluatorConfig(evaluator_ids=["Builtin.Correctness"]), |
| 178 | + evaluation_delay_seconds=180, |
| 179 | + ), |
| 180 | + dataset=dataset, |
| 181 | + agent_invoker=my_invoker_fn, |
| 182 | + span_collector=CloudWatchAgentSpanCollector(log_group_name=CW_LOG_GROUP, region="us-east-1"), |
| 183 | +) |
| 184 | +``` |
| 185 | + |
| 186 | +--- |
| 187 | + |
| 188 | +## Sample Prompts |
| 189 | + |
| 190 | +The following prompts are used in the notebook. They can also be sent directly to a |
| 191 | +deployed HR Assistant to generate sessions for evaluation. |
| 192 | + |
| 193 | +### Single-turn |
| 194 | + |
| 195 | +| Prompt | Expected tool | Expected outcome | |
| 196 | +|---|---|---| |
| 197 | +| `What is the current PTO balance for employee EMP-001?` | `get_pto_balance` | 10 remaining days (15 total, 5 used) | |
| 198 | +| `Please submit a PTO request for EMP-001 from 2026-04-14 to 2026-04-16 for a family vacation.` | `submit_pto_request` | Approved, request ID `PTO-2026-001` | |
| 199 | +| `Can you pull up the January 2026 pay stub for employee EMP-001?` | `get_pay_stub` | Gross $8,333.33, net $5,362.50 | |
| 200 | +| `What is the company PTO policy?` | `lookup_hr_policy` | 15 days/year, 2-day advance notice, 5-day rollover | |
| 201 | +| `How does the 401k match work?` | `get_benefits_summary` | 100% match up to 4%, 50% on next 2%, 3-year vesting | |
| 202 | +| `Check the PTO balance for EMP-002 and if they have at least 2 days, submit a request for 2026-05-26 to 2026-05-27.` | `get_pto_balance` → `submit_pto_request` | 3 days remaining → request approved | |
| 203 | + |
| 204 | +### Multi-turn |
| 205 | + |
| 206 | +**PTO planning (3 turns)** |
| 207 | +1. `How many PTO days do I have left? My employee ID is EMP-001.` |
| 208 | +2. `Great. I'd like to take December 23 to December 25 off. Please submit a request.` |
| 209 | +3. `Remind me — what is the policy on rolling over unused PTO?` |
| 210 | + |
| 211 | +Expected trajectory: `get_pto_balance` → `submit_pto_request` → `lookup_hr_policy` |
| 212 | + |
| 213 | +**New employee onboarding (4 turns)** |
| 214 | +1. `I just joined the company. What is the remote work policy?` |
| 215 | +2. `How much PTO do I get as a new employee?` |
| 216 | +3. `What life insurance benefit does the company provide?` |
| 217 | +4. `Can you check the current PTO balance for employee EMP-042?` |
| 218 | + |
| 219 | +Expected trajectory: `lookup_hr_policy` → `lookup_hr_policy` → `get_benefits_summary` → `get_pto_balance` |
| 220 | + |
| 221 | +--- |
| 222 | + |
| 223 | +## Custom Evaluators with Ground Truth |
| 224 | + |
| 225 | +Custom evaluators let you define evaluation criteria in natural language. The service |
| 226 | +substitutes **ground-truth placeholders** from `ReferenceInputs` before scoring. |
| 227 | + |
| 228 | +### Placeholder reference |
| 229 | + |
| 230 | +| Level | Placeholder | Populated from | |
| 231 | +|---|---|---| |
| 232 | +| TRACE | `{assistant_turn}` | Agent's actual response for that turn | |
| 233 | +| TRACE | `{expected_response}` | `ReferenceInputs.expected_response` | |
| 234 | +| TRACE | `{context}` | Conversation context preceding the turn | |
| 235 | +| SESSION | `{actual_tool_trajectory}` | Tools the agent called during the session | |
| 236 | +| SESSION | `{expected_tool_trajectory}` | `ReferenceInputs.expected_trajectory` | |
| 237 | +| SESSION | `{assertions}` | `ReferenceInputs.assertions` | |
| 238 | +| SESSION | `{available_tools}` | Tools available to the agent | |
| 239 | + |
| 240 | +### Creating a custom evaluator |
| 241 | + |
| 242 | +```python |
| 243 | +import boto3, uuid |
| 244 | + |
| 245 | +cp = boto3.client("bedrock-agentcore-control", region_name="us-east-1") |
| 246 | + |
| 247 | +# Trace-level: response similarity using ground-truth placeholders |
| 248 | +result = cp.create_evaluator( |
| 249 | + evaluatorName=f"ResponseSimilarity_{uuid.uuid4().hex[:8]}", |
| 250 | + level="TRACE", |
| 251 | + evaluatorConfig={ |
| 252 | + "llmAsAJudge": { |
| 253 | + "instructions": ( |
| 254 | + "Compare the agent's response with the expected response.\n" |
| 255 | + "Agent response: {assistant_turn}\n" |
| 256 | + "Expected response: {expected_response}\n\n" |
| 257 | + "Rate how closely the responses match on a scale of 0 to 1." |
| 258 | + ), |
| 259 | + "ratingScale": { |
| 260 | + "numerical": [ |
| 261 | + {"value": 0.0, "label": "not_similar", |
| 262 | + "definition": "Response is factually different from expected."}, |
| 263 | + {"value": 0.5, "label": "partially_similar", |
| 264 | + "definition": "Response partially matches expected."}, |
| 265 | + {"value": 1.0, "label": "highly_similar", |
| 266 | + "definition": "Response is semantically equivalent to expected."}, |
| 267 | + ] |
| 268 | + }, |
| 269 | + "modelConfig": { |
| 270 | + "bedrockEvaluatorModelConfig": { |
| 271 | + "modelId": "us.amazon.nova-lite-v1:0", |
| 272 | + "inferenceConfig": {"maxTokens": 512}, |
| 273 | + } |
| 274 | + }, |
| 275 | + } |
| 276 | + }, |
| 277 | +) |
| 278 | +custom_evaluator_id = result["evaluatorId"] |
| 279 | +``` |
| 280 | + |
| 281 | +Pass `custom_evaluator_id` to `EvaluationClient.run()` or `EvaluatorConfig` like any |
| 282 | +built-in evaluator ID. Seed the level cache to avoid an extra `get_evaluator` lookup: |
| 283 | + |
| 284 | +```python |
| 285 | +eval_client._evaluator_level_cache[custom_evaluator_id] = "TRACE" |
| 286 | +``` |
| 287 | + |
| 288 | +### Custom evaluators in this tutorial |
| 289 | + |
| 290 | +| Evaluator | Level | Placeholders used | Where used | |
| 291 | +|---|---|---|---| |
| 292 | +| `HRResponseSimilarity` | TRACE | `{assistant_turn}`, `{expected_response}` | EvaluationClient (Steps 5a, 5b), DatasetRunner (Step 6) | |
| 293 | +| `HRAssertionChecker` | SESSION | `{actual_tool_trajectory}`, `{expected_tool_trajectory}`, `{assertions}` | EvaluationClient (Step 5d, multi-turn), DatasetRunner (Step 6) | |
| 294 | + |
| 295 | +> **Note:** SESSION-level custom evaluators require a session with multiple tool calls to |
| 296 | +> extract a meaningful trajectory. They are used on multi-turn sessions in Step 5d and on |
| 297 | +> all DatasetRunner scenarios in Step 6, where a 180-second ingestion delay ensures span |
| 298 | +> data is complete before evaluation. |
| 299 | +
|
| 300 | +--- |
| 301 | + |
| 302 | +## Built-in Evaluators |
| 303 | + |
| 304 | +| Evaluator | Level | Ground truth required | |
| 305 | +|---|---|---| |
| 306 | +| `Builtin.Correctness` | TRACE | `expected_response` | |
| 307 | +| `Builtin.Helpfulness` | TRACE | None | |
| 308 | +| `Builtin.ResponseRelevance` | TRACE | None | |
| 309 | +| `Builtin.GoalSuccessRate` | SESSION | `assertions` | |
| 310 | +| `Builtin.TrajectoryExactOrderMatch` | SESSION | `expected_trajectory` | |
| 311 | +| `Builtin.TrajectoryInOrderMatch` | SESSION | `expected_trajectory` | |
| 312 | +| `Builtin.TrajectoryAnyOrderMatch` | SESSION | `expected_trajectory` | |
| 313 | + |
| 314 | +**Evaluation levels:** |
| 315 | +- **TRACE** — one result per conversational turn (agent response) |
| 316 | +- **SESSION** — one result per complete conversation |
| 317 | + |
| 318 | +--- |
| 319 | + |
| 320 | +## Files |
| 321 | + |
| 322 | +| File | Description | |
| 323 | +|---|---| |
| 324 | +| `groundtruth_evaluations.ipynb` | Main tutorial notebook — self-contained, end-to-end | |
| 325 | +| `requirements.txt` | Python dependencies installed into the agent container | |
| 326 | + |
| 327 | +`hr_assistant_agent.py` and `.bedrock_agentcore.yaml` are generated at runtime (by the `%%writefile` notebook cell and the starter toolkit respectively) |
| 328 | + |
| 329 | +--- |
| 330 | + |
| 331 | +## Clean Up |
| 332 | + |
| 333 | +### Delete the agent runtime |
| 334 | + |
| 335 | +Uncomment and run the cleanup cell in the notebook: |
| 336 | + |
| 337 | +```python |
| 338 | +agentcore_runtime.delete() |
| 339 | +``` |
| 340 | + |
| 341 | +Or via the AWS CLI: |
| 342 | + |
| 343 | +```bash |
| 344 | +aws bedrock-agentcore delete-agent-runtime \ |
| 345 | + --agent-runtime-id hr_assistant_eval_tutorial-xfZ3yiH356 \ |
| 346 | + --region us-east-1 |
| 347 | +``` |
| 348 | + |
| 349 | +### Delete custom evaluators |
| 350 | + |
| 351 | +```python |
| 352 | +import boto3 |
| 353 | + |
| 354 | +cp = boto3.client("bedrock-agentcore-control", region_name="us-east-1") |
| 355 | +for evaluator_id in [CUSTOM_RESPONSE_SIMILARITY_ID, CUSTOM_ASSERTION_CHECKER_ID]: |
| 356 | + cp.delete_evaluator(evaluatorId=evaluator_id) |
| 357 | + print(f"Deleted {evaluator_id}") |
| 358 | +``` |
| 359 | + |
| 360 | +### Delete the ECR repository |
| 361 | + |
| 362 | +```bash |
| 363 | +aws ecr delete-repository \ |
| 364 | + --repository-name bedrock-agentcore-hr_assistant_eval_tutorial \ |
| 365 | + --region us-east-1 \ |
| 366 | + --force |
| 367 | +``` |
| 368 | + |
| 369 | +### Delete CloudWatch log group |
| 370 | + |
| 371 | +```bash |
| 372 | +aws logs delete-log-group \ |
| 373 | + --log-group-name /aws/bedrock-agentcore/runtimes/hr_assistant_eval_tutorial-xfZ3yiH356-DEFAULT \ |
| 374 | + --region us-east-1 |
| 375 | +``` |
| 376 | + |
| 377 | +--- |
| 378 | + |
| 379 | +## Additional Resources |
| 380 | + |
| 381 | +- [Ground-truth evaluations — custom evaluators](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/ground-truth-evaluations.html#gt-custom-evaluators) |
| 382 | +- [Dataset-based evaluations](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/dataset-evaluations.html) |
| 383 | +- [Amazon Bedrock AgentCore Developer Guide](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/) |
| 384 | +- [Strands Agents SDK](https://strandsagents.com/) |
| 385 | +- [Build reliable AI agents with Amazon Bedrock AgentCore Evaluations](https://aws.amazon.com/blogs/machine-learning/build-reliable-ai-agents-with-amazon-bedrock-agentcore-evaluations/) |
0 commit comments