Skip to content

Commit a00a68b

Browse files
Groundtruth evaluations (#1229)
* Add groundtruth-based evaluations tutorial * updating README * drop .py script, agent script is created at notebook runtime
1 parent 5ad508a commit a00a68b

4 files changed

Lines changed: 1781 additions & 0 deletions

File tree

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Generated by the %%writefile cell in groundtruth_evaluations.ipynb
2+
hr_assistant_agent.py
3+
4+
# Generated by bedrock-agentcore-starter-toolkit on first deploy
5+
.bedrock_agentcore.yaml
Lines changed: 385 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,385 @@
1+
# Ground Truth Evaluations with Custom Evaluators
2+
3+
## Introduction
4+
5+
This tutorial demonstrates end-to-end evaluation of an agentic application using
6+
[**Amazon Bedrock AgentCore Evaluations**](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/evaluations.html) with ground-truth reference inputs. It covers
7+
the two primary evaluation interfaces — `EvaluationClient` and
8+
`OnDemandEvaluationDatasetRunner` — and shows how to create **custom LLM-as-a-judge
9+
evaluators** that use ground-truth placeholders to tailor scoring criteria to your
10+
application domain.
11+
12+
The tutorial deploys an **HR Assistant agent** for Acme Corp — a
13+
[Strands Agents](https://strandsagents.com/) application that helps employees with PTO
14+
management, HR policy lookups, benefits information, and pay stub retrieval. Its tools
15+
return deterministic mock data, making evaluation results fully reproducible.
16+
17+
### Key concepts covered
18+
19+
| Concept | Description |
20+
|---|---|
21+
| `EvaluationClient` | Evaluate specific existing CloudWatch sessions against ground-truth references |
22+
| `OnDemandEvaluationDatasetRunner` | Define a test dataset, auto-invoke the agent per scenario, and evaluate the results |
23+
| `ReferenceInputs` | Supply `expected_response`, `expected_trajectory`, and `assertions` as ground truth |
24+
| Custom evaluators | Create LLM-as-a-judge evaluators with domain-specific instructions and ground-truth placeholders |
25+
26+
27+
> **Further reading**
28+
> - [Ground-truth evaluations — custom evaluators](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/ground-truth-evaluations.html#gt-custom-evaluators)
29+
> - [Dataset-based evaluations](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/dataset-evaluations.html)
30+
31+
---
32+
33+
## Architecture
34+
35+
```
36+
┌─────────────────────────────────────────────────────────────────────────┐
37+
│ Tutorial Notebook (groundtruth_evaluations.ipynb) │
38+
│ │
39+
│ Step 1 ──► bedrock-agentcore-starter-toolkit │
40+
│ │ CodeBuild builds image, pushes to ECR │
41+
│ └──► AgentCore Runtime (HR Assistant Agent) │
42+
│ │ invoke_agent_runtime() │
43+
│ Step 2 ──► bedrock-agentcore-control ──► Custom Evaluators │
44+
│ create_evaluator() │
45+
│ │
46+
│ Step 3 ──► AgentCore Runtime (generate sessions) │
47+
│ │ OTel spans ──► CloudWatch Logs │
48+
│ │
49+
│ Step 4 ──► EvaluationClient.run() │
50+
│ │ CloudWatchAgentSpanCollector reads spans │
51+
│ └──► Evaluate API ──► Built-in + Custom Evaluators │
52+
│ └──► Scores & Explanations │
53+
│ │
54+
│ Step 5 ──► OnDemandEvaluationDatasetRunner.run() │
55+
│ │ Invokes agent per scenario │
56+
│ │ Waits for CloudWatch ingestion │
57+
│ └──► Evaluate API ──► Built-in + Custom Evaluators │
58+
│ └──► Per-scenario Results │
59+
└─────────────────────────────────────────────────────────────────────────┘
60+
```
61+
62+
**Component roles**
63+
64+
| Component | Role |
65+
|---|---|
66+
| AgentCore Runtime | Hosts the containerised HR Assistant, emits OTel spans to CloudWatch |
67+
| CloudWatch Logs | Stores session spans; queried by `CloudWatchAgentSpanCollector` |
68+
| `bedrock-agentcore-control` | Control plane — creates custom evaluators and agent runtimes |
69+
| Evaluate API (`bedrock-agentcore`) | Data plane — scores sessions against evaluator definitions |
70+
| Starter Toolkit | Builds the Docker image via CodeBuild and registers the runtime; no local Docker required |
71+
72+
---
73+
74+
## Prerequisites
75+
76+
- **Python 3.10+** with the packages in `requirements.txt`
77+
- **AWS credentials** configured (e.g. via `aws configure` or environment variables) with
78+
permissions for:
79+
- `bedrock-agentcore:*` — invoke agent runtime and call Evaluate API
80+
- `bedrock-agentcore-control:CreateAgentRuntime`, `UpdateAgentRuntime`,
81+
`GetAgentRuntime`, `CreateEvaluator` — deploy agent and register evaluators
82+
- `logs:FilterLogEvents`, `logs:DescribeLogGroups`, `logs:StartQuery`,
83+
`logs:GetQueryResults` — read CloudWatch spans
84+
- `ecr:GetAuthorizationToken`, `ecr:BatchCheckLayerAvailability`,
85+
`ecr:InitiateLayerUpload`, `ecr:PutImage` — push container image
86+
- `codebuild:StartBuild`, `codebuild:BatchGetBuilds` — image build via CodeBuild
87+
- `iam:CreateRole`, `iam:AttachRolePolicy`, `iam:PassRole` — auto-create execution roles
88+
- `s3:PutObject`, `s3:GetObject` — CodeBuild source upload
89+
- **No local Docker required** — the starter toolkit builds the container image via
90+
AWS CodeBuild
91+
92+
Install dependencies:
93+
94+
```bash
95+
pip install -r requirements.txt
96+
```
97+
98+
---
99+
100+
## Usage
101+
102+
### Run the notebook
103+
104+
Open and run [`groundtruth_evaluations.ipynb`](groundtruth_evaluations.ipynb) top-to-bottom.
105+
Each cell is idempotent — re-running the notebook updates the existing agent runtime and
106+
creates fresh custom evaluators with a unique suffix to avoid naming conflicts.
107+
108+
```bash
109+
jupyter notebook groundtruth_evaluations.ipynb
110+
```
111+
112+
Or execute non-interactively:
113+
114+
```bash
115+
jupyter nbconvert --to notebook --execute --inplace groundtruth_evaluations.ipynb
116+
```
117+
118+
### Notebook walkthrough
119+
120+
| Step | Cell(s) | What happens |
121+
|---|---|---|
122+
| **1 — Install** | `install` | Installs `bedrock-agentcore`, `strands-agents`, and other dependencies |
123+
| **2 — Configure** | `setup` | Creates a boto3 session and sets `REGION` |
124+
| **3a — Deploy agent** | `nn72gdo2s4h`, `deploy`, `wait-deploy`, `agent-config` | Writes `hr_assistant_agent.py`, builds image via CodeBuild, creates/updates the AgentCore Runtime, polls until `READY` |
125+
| **3b — Create evaluators** | `76hyptexblj` | Creates `HRResponseSimilarity` (TRACE) and `HRAssertionChecker` (SESSION) custom evaluators via `bedrock-agentcore-control` |
126+
| **4 — Invoke agent** | `invoke-single`, `invoke-multi`, `invoke-onboard` | Runs 5 sessions (single- and multi-turn), waits 60 s for CloudWatch ingestion |
127+
| **5 — EvaluationClient** | `ec-*` | Evaluates each session by session ID using built-in and custom evaluators |
128+
| **6 — DatasetRunner** | `runner-*` | Defines a 5-scenario dataset, invokes the agent per scenario, waits 180 s, evaluates all scenarios |
129+
| **7 — Cleanup** | `cleanup` | (Commented out) Deletes the agent runtime |
130+
131+
### Using `EvaluationClient` directly
132+
133+
```python
134+
from bedrock_agentcore.evaluation import EvaluationClient, ReferenceInputs
135+
from datetime import timedelta
136+
137+
ec = EvaluationClient(region_name="us-east-1")
138+
139+
results = ec.run(
140+
evaluator_ids=["Builtin.Correctness", "Builtin.GoalSuccessRate", MY_CUSTOM_EVAL_ID],
141+
session_id="<session-id>",
142+
agent_id="<agent-id>",
143+
look_back_time=timedelta(hours=2),
144+
reference_inputs=ReferenceInputs(
145+
expected_response="Employee EMP-001 has 10 remaining PTO days.",
146+
assertions=["Agent called get_pto_balance", "Agent reported 10 remaining days"],
147+
expected_trajectory=["get_pto_balance"],
148+
),
149+
)
150+
```
151+
152+
### Using `OnDemandEvaluationDatasetRunner` directly
153+
154+
```python
155+
from bedrock_agentcore.evaluation import (
156+
Dataset, PredefinedScenario, Turn,
157+
EvaluationRunConfig, EvaluatorConfig,
158+
OnDemandEvaluationDatasetRunner,
159+
CloudWatchAgentSpanCollector,
160+
)
161+
162+
dataset = Dataset(scenarios=[
163+
PredefinedScenario(
164+
scenario_id="pto-check",
165+
turns=[Turn(
166+
input="What is the PTO balance for EMP-001?",
167+
expected_response="EMP-001 has 10 remaining PTO days.",
168+
)],
169+
expected_trajectory=["get_pto_balance"],
170+
assertions=["Agent reported 10 remaining PTO days"],
171+
),
172+
])
173+
174+
runner = OnDemandEvaluationDatasetRunner(region="us-east-1")
175+
result = runner.run(
176+
config=EvaluationRunConfig(
177+
evaluator_config=EvaluatorConfig(evaluator_ids=["Builtin.Correctness"]),
178+
evaluation_delay_seconds=180,
179+
),
180+
dataset=dataset,
181+
agent_invoker=my_invoker_fn,
182+
span_collector=CloudWatchAgentSpanCollector(log_group_name=CW_LOG_GROUP, region="us-east-1"),
183+
)
184+
```
185+
186+
---
187+
188+
## Sample Prompts
189+
190+
The following prompts are used in the notebook. They can also be sent directly to a
191+
deployed HR Assistant to generate sessions for evaluation.
192+
193+
### Single-turn
194+
195+
| Prompt | Expected tool | Expected outcome |
196+
|---|---|---|
197+
| `What is the current PTO balance for employee EMP-001?` | `get_pto_balance` | 10 remaining days (15 total, 5 used) |
198+
| `Please submit a PTO request for EMP-001 from 2026-04-14 to 2026-04-16 for a family vacation.` | `submit_pto_request` | Approved, request ID `PTO-2026-001` |
199+
| `Can you pull up the January 2026 pay stub for employee EMP-001?` | `get_pay_stub` | Gross $8,333.33, net $5,362.50 |
200+
| `What is the company PTO policy?` | `lookup_hr_policy` | 15 days/year, 2-day advance notice, 5-day rollover |
201+
| `How does the 401k match work?` | `get_benefits_summary` | 100% match up to 4%, 50% on next 2%, 3-year vesting |
202+
| `Check the PTO balance for EMP-002 and if they have at least 2 days, submit a request for 2026-05-26 to 2026-05-27.` | `get_pto_balance``submit_pto_request` | 3 days remaining → request approved |
203+
204+
### Multi-turn
205+
206+
**PTO planning (3 turns)**
207+
1. `How many PTO days do I have left? My employee ID is EMP-001.`
208+
2. `Great. I'd like to take December 23 to December 25 off. Please submit a request.`
209+
3. `Remind me — what is the policy on rolling over unused PTO?`
210+
211+
Expected trajectory: `get_pto_balance``submit_pto_request``lookup_hr_policy`
212+
213+
**New employee onboarding (4 turns)**
214+
1. `I just joined the company. What is the remote work policy?`
215+
2. `How much PTO do I get as a new employee?`
216+
3. `What life insurance benefit does the company provide?`
217+
4. `Can you check the current PTO balance for employee EMP-042?`
218+
219+
Expected trajectory: `lookup_hr_policy``lookup_hr_policy``get_benefits_summary``get_pto_balance`
220+
221+
---
222+
223+
## Custom Evaluators with Ground Truth
224+
225+
Custom evaluators let you define evaluation criteria in natural language. The service
226+
substitutes **ground-truth placeholders** from `ReferenceInputs` before scoring.
227+
228+
### Placeholder reference
229+
230+
| Level | Placeholder | Populated from |
231+
|---|---|---|
232+
| TRACE | `{assistant_turn}` | Agent's actual response for that turn |
233+
| TRACE | `{expected_response}` | `ReferenceInputs.expected_response` |
234+
| TRACE | `{context}` | Conversation context preceding the turn |
235+
| SESSION | `{actual_tool_trajectory}` | Tools the agent called during the session |
236+
| SESSION | `{expected_tool_trajectory}` | `ReferenceInputs.expected_trajectory` |
237+
| SESSION | `{assertions}` | `ReferenceInputs.assertions` |
238+
| SESSION | `{available_tools}` | Tools available to the agent |
239+
240+
### Creating a custom evaluator
241+
242+
```python
243+
import boto3, uuid
244+
245+
cp = boto3.client("bedrock-agentcore-control", region_name="us-east-1")
246+
247+
# Trace-level: response similarity using ground-truth placeholders
248+
result = cp.create_evaluator(
249+
evaluatorName=f"ResponseSimilarity_{uuid.uuid4().hex[:8]}",
250+
level="TRACE",
251+
evaluatorConfig={
252+
"llmAsAJudge": {
253+
"instructions": (
254+
"Compare the agent's response with the expected response.\n"
255+
"Agent response: {assistant_turn}\n"
256+
"Expected response: {expected_response}\n\n"
257+
"Rate how closely the responses match on a scale of 0 to 1."
258+
),
259+
"ratingScale": {
260+
"numerical": [
261+
{"value": 0.0, "label": "not_similar",
262+
"definition": "Response is factually different from expected."},
263+
{"value": 0.5, "label": "partially_similar",
264+
"definition": "Response partially matches expected."},
265+
{"value": 1.0, "label": "highly_similar",
266+
"definition": "Response is semantically equivalent to expected."},
267+
]
268+
},
269+
"modelConfig": {
270+
"bedrockEvaluatorModelConfig": {
271+
"modelId": "us.amazon.nova-lite-v1:0",
272+
"inferenceConfig": {"maxTokens": 512},
273+
}
274+
},
275+
}
276+
},
277+
)
278+
custom_evaluator_id = result["evaluatorId"]
279+
```
280+
281+
Pass `custom_evaluator_id` to `EvaluationClient.run()` or `EvaluatorConfig` like any
282+
built-in evaluator ID. Seed the level cache to avoid an extra `get_evaluator` lookup:
283+
284+
```python
285+
eval_client._evaluator_level_cache[custom_evaluator_id] = "TRACE"
286+
```
287+
288+
### Custom evaluators in this tutorial
289+
290+
| Evaluator | Level | Placeholders used | Where used |
291+
|---|---|---|---|
292+
| `HRResponseSimilarity` | TRACE | `{assistant_turn}`, `{expected_response}` | EvaluationClient (Steps 5a, 5b), DatasetRunner (Step 6) |
293+
| `HRAssertionChecker` | SESSION | `{actual_tool_trajectory}`, `{expected_tool_trajectory}`, `{assertions}` | EvaluationClient (Step 5d, multi-turn), DatasetRunner (Step 6) |
294+
295+
> **Note:** SESSION-level custom evaluators require a session with multiple tool calls to
296+
> extract a meaningful trajectory. They are used on multi-turn sessions in Step 5d and on
297+
> all DatasetRunner scenarios in Step 6, where a 180-second ingestion delay ensures span
298+
> data is complete before evaluation.
299+
300+
---
301+
302+
## Built-in Evaluators
303+
304+
| Evaluator | Level | Ground truth required |
305+
|---|---|---|
306+
| `Builtin.Correctness` | TRACE | `expected_response` |
307+
| `Builtin.Helpfulness` | TRACE | None |
308+
| `Builtin.ResponseRelevance` | TRACE | None |
309+
| `Builtin.GoalSuccessRate` | SESSION | `assertions` |
310+
| `Builtin.TrajectoryExactOrderMatch` | SESSION | `expected_trajectory` |
311+
| `Builtin.TrajectoryInOrderMatch` | SESSION | `expected_trajectory` |
312+
| `Builtin.TrajectoryAnyOrderMatch` | SESSION | `expected_trajectory` |
313+
314+
**Evaluation levels:**
315+
- **TRACE** — one result per conversational turn (agent response)
316+
- **SESSION** — one result per complete conversation
317+
318+
---
319+
320+
## Files
321+
322+
| File | Description |
323+
|---|---|
324+
| `groundtruth_evaluations.ipynb` | Main tutorial notebook — self-contained, end-to-end |
325+
| `requirements.txt` | Python dependencies installed into the agent container |
326+
327+
`hr_assistant_agent.py` and `.bedrock_agentcore.yaml` are generated at runtime (by the `%%writefile` notebook cell and the starter toolkit respectively)
328+
329+
---
330+
331+
## Clean Up
332+
333+
### Delete the agent runtime
334+
335+
Uncomment and run the cleanup cell in the notebook:
336+
337+
```python
338+
agentcore_runtime.delete()
339+
```
340+
341+
Or via the AWS CLI:
342+
343+
```bash
344+
aws bedrock-agentcore delete-agent-runtime \
345+
--agent-runtime-id hr_assistant_eval_tutorial-xfZ3yiH356 \
346+
--region us-east-1
347+
```
348+
349+
### Delete custom evaluators
350+
351+
```python
352+
import boto3
353+
354+
cp = boto3.client("bedrock-agentcore-control", region_name="us-east-1")
355+
for evaluator_id in [CUSTOM_RESPONSE_SIMILARITY_ID, CUSTOM_ASSERTION_CHECKER_ID]:
356+
cp.delete_evaluator(evaluatorId=evaluator_id)
357+
print(f"Deleted {evaluator_id}")
358+
```
359+
360+
### Delete the ECR repository
361+
362+
```bash
363+
aws ecr delete-repository \
364+
--repository-name bedrock-agentcore-hr_assistant_eval_tutorial \
365+
--region us-east-1 \
366+
--force
367+
```
368+
369+
### Delete CloudWatch log group
370+
371+
```bash
372+
aws logs delete-log-group \
373+
--log-group-name /aws/bedrock-agentcore/runtimes/hr_assistant_eval_tutorial-xfZ3yiH356-DEFAULT \
374+
--region us-east-1
375+
```
376+
377+
---
378+
379+
## Additional Resources
380+
381+
- [Ground-truth evaluations — custom evaluators](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/ground-truth-evaluations.html#gt-custom-evaluators)
382+
- [Dataset-based evaluations](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/dataset-evaluations.html)
383+
- [Amazon Bedrock AgentCore Developer Guide](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/)
384+
- [Strands Agents SDK](https://strandsagents.com/)
385+
- [Build reliable AI agents with Amazon Bedrock AgentCore Evaluations](https://aws.amazon.com/blogs/machine-learning/build-reliable-ai-agents-with-amazon-bedrock-agentcore-evaluations/)

0 commit comments

Comments
 (0)