diff --git a/01-tutorials/07-AgentCore-evaluations/05-groundtruth-based-evalautions/.gitignore b/01-tutorials/07-AgentCore-evaluations/05-groundtruth-based-evalautions/.gitignore new file mode 100644 index 000000000..b17e1aefb --- /dev/null +++ b/01-tutorials/07-AgentCore-evaluations/05-groundtruth-based-evalautions/.gitignore @@ -0,0 +1,5 @@ +# Generated by the %%writefile cell in groundtruth_evaluations.ipynb +hr_assistant_agent.py + +# Generated by bedrock-agentcore-starter-toolkit on first deploy +.bedrock_agentcore.yaml diff --git a/01-tutorials/07-AgentCore-evaluations/05-groundtruth-based-evalautions/README.md b/01-tutorials/07-AgentCore-evaluations/05-groundtruth-based-evalautions/README.md new file mode 100644 index 000000000..c4a34dd64 --- /dev/null +++ b/01-tutorials/07-AgentCore-evaluations/05-groundtruth-based-evalautions/README.md @@ -0,0 +1,385 @@ +# Ground Truth Evaluations with Custom Evaluators + +## Introduction + +This tutorial demonstrates end-to-end evaluation of an agentic application using +[**Amazon Bedrock AgentCore Evaluations**](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/evaluations.html) with ground-truth reference inputs. It covers +the two primary evaluation interfaces — `EvaluationClient` and +`OnDemandEvaluationDatasetRunner` — and shows how to create **custom LLM-as-a-judge +evaluators** that use ground-truth placeholders to tailor scoring criteria to your +application domain. + +The tutorial deploys an **HR Assistant agent** for Acme Corp — a +[Strands Agents](https://strandsagents.com/) application that helps employees with PTO +management, HR policy lookups, benefits information, and pay stub retrieval. Its tools +return deterministic mock data, making evaluation results fully reproducible. + +### Key concepts covered + +| Concept | Description | +|---|---| +| `EvaluationClient` | Evaluate specific existing CloudWatch sessions against ground-truth references | +| `OnDemandEvaluationDatasetRunner` | Define a test dataset, auto-invoke the agent per scenario, and evaluate the results | +| `ReferenceInputs` | Supply `expected_response`, `expected_trajectory`, and `assertions` as ground truth | +| Custom evaluators | Create LLM-as-a-judge evaluators with domain-specific instructions and ground-truth placeholders | + + +> **Further reading** +> - [Ground-truth evaluations — custom evaluators](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/ground-truth-evaluations.html#gt-custom-evaluators) +> - [Dataset-based evaluations](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/dataset-evaluations.html) + +--- + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────────────────┐ +│ Tutorial Notebook (groundtruth_evaluations.ipynb) │ +│ │ +│ Step 1 ──► bedrock-agentcore-starter-toolkit │ +│ │ CodeBuild builds image, pushes to ECR │ +│ └──► AgentCore Runtime (HR Assistant Agent) │ +│ │ invoke_agent_runtime() │ +│ Step 2 ──► bedrock-agentcore-control ──► Custom Evaluators │ +│ create_evaluator() │ +│ │ +│ Step 3 ──► AgentCore Runtime (generate sessions) │ +│ │ OTel spans ──► CloudWatch Logs │ +│ │ +│ Step 4 ──► EvaluationClient.run() │ +│ │ CloudWatchAgentSpanCollector reads spans │ +│ └──► Evaluate API ──► Built-in + Custom Evaluators │ +│ └──► Scores & Explanations │ +│ │ +│ Step 5 ──► OnDemandEvaluationDatasetRunner.run() │ +│ │ Invokes agent per scenario │ +│ │ Waits for CloudWatch ingestion │ +│ └──► Evaluate API ──► Built-in + Custom Evaluators │ +│ └──► Per-scenario Results │ +└─────────────────────────────────────────────────────────────────────────┘ +``` + +**Component roles** + +| Component | Role | +|---|---| +| AgentCore Runtime | Hosts the containerised HR Assistant, emits OTel spans to CloudWatch | +| CloudWatch Logs | Stores session spans; queried by `CloudWatchAgentSpanCollector` | +| `bedrock-agentcore-control` | Control plane — creates custom evaluators and agent runtimes | +| Evaluate API (`bedrock-agentcore`) | Data plane — scores sessions against evaluator definitions | +| Starter Toolkit | Builds the Docker image via CodeBuild and registers the runtime; no local Docker required | + +--- + +## Prerequisites + +- **Python 3.10+** with the packages in `requirements.txt` +- **AWS credentials** configured (e.g. via `aws configure` or environment variables) with + permissions for: + - `bedrock-agentcore:*` — invoke agent runtime and call Evaluate API + - `bedrock-agentcore-control:CreateAgentRuntime`, `UpdateAgentRuntime`, + `GetAgentRuntime`, `CreateEvaluator` — deploy agent and register evaluators + - `logs:FilterLogEvents`, `logs:DescribeLogGroups`, `logs:StartQuery`, + `logs:GetQueryResults` — read CloudWatch spans + - `ecr:GetAuthorizationToken`, `ecr:BatchCheckLayerAvailability`, + `ecr:InitiateLayerUpload`, `ecr:PutImage` — push container image + - `codebuild:StartBuild`, `codebuild:BatchGetBuilds` — image build via CodeBuild + - `iam:CreateRole`, `iam:AttachRolePolicy`, `iam:PassRole` — auto-create execution roles + - `s3:PutObject`, `s3:GetObject` — CodeBuild source upload +- **No local Docker required** — the starter toolkit builds the container image via + AWS CodeBuild + +Install dependencies: + +```bash +pip install -r requirements.txt +``` + +--- + +## Usage + +### Run the notebook + +Open and run [`groundtruth_evaluations.ipynb`](groundtruth_evaluations.ipynb) top-to-bottom. +Each cell is idempotent — re-running the notebook updates the existing agent runtime and +creates fresh custom evaluators with a unique suffix to avoid naming conflicts. + +```bash +jupyter notebook groundtruth_evaluations.ipynb +``` + +Or execute non-interactively: + +```bash +jupyter nbconvert --to notebook --execute --inplace groundtruth_evaluations.ipynb +``` + +### Notebook walkthrough + +| Step | Cell(s) | What happens | +|---|---|---| +| **1 — Install** | `install` | Installs `bedrock-agentcore`, `strands-agents`, and other dependencies | +| **2 — Configure** | `setup` | Creates a boto3 session and sets `REGION` | +| **3a — Deploy agent** | `nn72gdo2s4h`, `deploy`, `wait-deploy`, `agent-config` | Writes `hr_assistant_agent.py`, builds image via CodeBuild, creates/updates the AgentCore Runtime, polls until `READY` | +| **3b — Create evaluators** | `76hyptexblj` | Creates `HRResponseSimilarity` (TRACE) and `HRAssertionChecker` (SESSION) custom evaluators via `bedrock-agentcore-control` | +| **4 — Invoke agent** | `invoke-single`, `invoke-multi`, `invoke-onboard` | Runs 5 sessions (single- and multi-turn), waits 60 s for CloudWatch ingestion | +| **5 — EvaluationClient** | `ec-*` | Evaluates each session by session ID using built-in and custom evaluators | +| **6 — DatasetRunner** | `runner-*` | Defines a 5-scenario dataset, invokes the agent per scenario, waits 180 s, evaluates all scenarios | +| **7 — Cleanup** | `cleanup` | (Commented out) Deletes the agent runtime | + +### Using `EvaluationClient` directly + +```python +from bedrock_agentcore.evaluation import EvaluationClient, ReferenceInputs +from datetime import timedelta + +ec = EvaluationClient(region_name="us-east-1") + +results = ec.run( + evaluator_ids=["Builtin.Correctness", "Builtin.GoalSuccessRate", MY_CUSTOM_EVAL_ID], + session_id="", + agent_id="", + look_back_time=timedelta(hours=2), + reference_inputs=ReferenceInputs( + expected_response="Employee EMP-001 has 10 remaining PTO days.", + assertions=["Agent called get_pto_balance", "Agent reported 10 remaining days"], + expected_trajectory=["get_pto_balance"], + ), +) +``` + +### Using `OnDemandEvaluationDatasetRunner` directly + +```python +from bedrock_agentcore.evaluation import ( + Dataset, PredefinedScenario, Turn, + EvaluationRunConfig, EvaluatorConfig, + OnDemandEvaluationDatasetRunner, + CloudWatchAgentSpanCollector, +) + +dataset = Dataset(scenarios=[ + PredefinedScenario( + scenario_id="pto-check", + turns=[Turn( + input="What is the PTO balance for EMP-001?", + expected_response="EMP-001 has 10 remaining PTO days.", + )], + expected_trajectory=["get_pto_balance"], + assertions=["Agent reported 10 remaining PTO days"], + ), +]) + +runner = OnDemandEvaluationDatasetRunner(region="us-east-1") +result = runner.run( + config=EvaluationRunConfig( + evaluator_config=EvaluatorConfig(evaluator_ids=["Builtin.Correctness"]), + evaluation_delay_seconds=180, + ), + dataset=dataset, + agent_invoker=my_invoker_fn, + span_collector=CloudWatchAgentSpanCollector(log_group_name=CW_LOG_GROUP, region="us-east-1"), +) +``` + +--- + +## Sample Prompts + +The following prompts are used in the notebook. They can also be sent directly to a +deployed HR Assistant to generate sessions for evaluation. + +### Single-turn + +| Prompt | Expected tool | Expected outcome | +|---|---|---| +| `What is the current PTO balance for employee EMP-001?` | `get_pto_balance` | 10 remaining days (15 total, 5 used) | +| `Please submit a PTO request for EMP-001 from 2026-04-14 to 2026-04-16 for a family vacation.` | `submit_pto_request` | Approved, request ID `PTO-2026-001` | +| `Can you pull up the January 2026 pay stub for employee EMP-001?` | `get_pay_stub` | Gross $8,333.33, net $5,362.50 | +| `What is the company PTO policy?` | `lookup_hr_policy` | 15 days/year, 2-day advance notice, 5-day rollover | +| `How does the 401k match work?` | `get_benefits_summary` | 100% match up to 4%, 50% on next 2%, 3-year vesting | +| `Check the PTO balance for EMP-002 and if they have at least 2 days, submit a request for 2026-05-26 to 2026-05-27.` | `get_pto_balance` → `submit_pto_request` | 3 days remaining → request approved | + +### Multi-turn + +**PTO planning (3 turns)** +1. `How many PTO days do I have left? My employee ID is EMP-001.` +2. `Great. I'd like to take December 23 to December 25 off. Please submit a request.` +3. `Remind me — what is the policy on rolling over unused PTO?` + +Expected trajectory: `get_pto_balance` → `submit_pto_request` → `lookup_hr_policy` + +**New employee onboarding (4 turns)** +1. `I just joined the company. What is the remote work policy?` +2. `How much PTO do I get as a new employee?` +3. `What life insurance benefit does the company provide?` +4. `Can you check the current PTO balance for employee EMP-042?` + +Expected trajectory: `lookup_hr_policy` → `lookup_hr_policy` → `get_benefits_summary` → `get_pto_balance` + +--- + +## Custom Evaluators with Ground Truth + +Custom evaluators let you define evaluation criteria in natural language. The service +substitutes **ground-truth placeholders** from `ReferenceInputs` before scoring. + +### Placeholder reference + +| Level | Placeholder | Populated from | +|---|---|---| +| TRACE | `{assistant_turn}` | Agent's actual response for that turn | +| TRACE | `{expected_response}` | `ReferenceInputs.expected_response` | +| TRACE | `{context}` | Conversation context preceding the turn | +| SESSION | `{actual_tool_trajectory}` | Tools the agent called during the session | +| SESSION | `{expected_tool_trajectory}` | `ReferenceInputs.expected_trajectory` | +| SESSION | `{assertions}` | `ReferenceInputs.assertions` | +| SESSION | `{available_tools}` | Tools available to the agent | + +### Creating a custom evaluator + +```python +import boto3, uuid + +cp = boto3.client("bedrock-agentcore-control", region_name="us-east-1") + +# Trace-level: response similarity using ground-truth placeholders +result = cp.create_evaluator( + evaluatorName=f"ResponseSimilarity_{uuid.uuid4().hex[:8]}", + level="TRACE", + evaluatorConfig={ + "llmAsAJudge": { + "instructions": ( + "Compare the agent's response with the expected response.\n" + "Agent response: {assistant_turn}\n" + "Expected response: {expected_response}\n\n" + "Rate how closely the responses match on a scale of 0 to 1." + ), + "ratingScale": { + "numerical": [ + {"value": 0.0, "label": "not_similar", + "definition": "Response is factually different from expected."}, + {"value": 0.5, "label": "partially_similar", + "definition": "Response partially matches expected."}, + {"value": 1.0, "label": "highly_similar", + "definition": "Response is semantically equivalent to expected."}, + ] + }, + "modelConfig": { + "bedrockEvaluatorModelConfig": { + "modelId": "us.amazon.nova-lite-v1:0", + "inferenceConfig": {"maxTokens": 512}, + } + }, + } + }, +) +custom_evaluator_id = result["evaluatorId"] +``` + +Pass `custom_evaluator_id` to `EvaluationClient.run()` or `EvaluatorConfig` like any +built-in evaluator ID. Seed the level cache to avoid an extra `get_evaluator` lookup: + +```python +eval_client._evaluator_level_cache[custom_evaluator_id] = "TRACE" +``` + +### Custom evaluators in this tutorial + +| Evaluator | Level | Placeholders used | Where used | +|---|---|---|---| +| `HRResponseSimilarity` | TRACE | `{assistant_turn}`, `{expected_response}` | EvaluationClient (Steps 5a, 5b), DatasetRunner (Step 6) | +| `HRAssertionChecker` | SESSION | `{actual_tool_trajectory}`, `{expected_tool_trajectory}`, `{assertions}` | EvaluationClient (Step 5d, multi-turn), DatasetRunner (Step 6) | + +> **Note:** SESSION-level custom evaluators require a session with multiple tool calls to +> extract a meaningful trajectory. They are used on multi-turn sessions in Step 5d and on +> all DatasetRunner scenarios in Step 6, where a 180-second ingestion delay ensures span +> data is complete before evaluation. + +--- + +## Built-in Evaluators + +| Evaluator | Level | Ground truth required | +|---|---|---| +| `Builtin.Correctness` | TRACE | `expected_response` | +| `Builtin.Helpfulness` | TRACE | None | +| `Builtin.ResponseRelevance` | TRACE | None | +| `Builtin.GoalSuccessRate` | SESSION | `assertions` | +| `Builtin.TrajectoryExactOrderMatch` | SESSION | `expected_trajectory` | +| `Builtin.TrajectoryInOrderMatch` | SESSION | `expected_trajectory` | +| `Builtin.TrajectoryAnyOrderMatch` | SESSION | `expected_trajectory` | + +**Evaluation levels:** +- **TRACE** — one result per conversational turn (agent response) +- **SESSION** — one result per complete conversation + +--- + +## Files + +| File | Description | +|---|---| +| `groundtruth_evaluations.ipynb` | Main tutorial notebook — self-contained, end-to-end | +| `requirements.txt` | Python dependencies installed into the agent container | + +`hr_assistant_agent.py` and `.bedrock_agentcore.yaml` are generated at runtime (by the `%%writefile` notebook cell and the starter toolkit respectively) + +--- + +## Clean Up + +### Delete the agent runtime + +Uncomment and run the cleanup cell in the notebook: + +```python +agentcore_runtime.delete() +``` + +Or via the AWS CLI: + +```bash +aws bedrock-agentcore delete-agent-runtime \ + --agent-runtime-id hr_assistant_eval_tutorial-xfZ3yiH356 \ + --region us-east-1 +``` + +### Delete custom evaluators + +```python +import boto3 + +cp = boto3.client("bedrock-agentcore-control", region_name="us-east-1") +for evaluator_id in [CUSTOM_RESPONSE_SIMILARITY_ID, CUSTOM_ASSERTION_CHECKER_ID]: + cp.delete_evaluator(evaluatorId=evaluator_id) + print(f"Deleted {evaluator_id}") +``` + +### Delete the ECR repository + +```bash +aws ecr delete-repository \ + --repository-name bedrock-agentcore-hr_assistant_eval_tutorial \ + --region us-east-1 \ + --force +``` + +### Delete CloudWatch log group + +```bash +aws logs delete-log-group \ + --log-group-name /aws/bedrock-agentcore/runtimes/hr_assistant_eval_tutorial-xfZ3yiH356-DEFAULT \ + --region us-east-1 +``` + +--- + +## Additional Resources + +- [Ground-truth evaluations — custom evaluators](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/ground-truth-evaluations.html#gt-custom-evaluators) +- [Dataset-based evaluations](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/dataset-evaluations.html) +- [Amazon Bedrock AgentCore Developer Guide](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/) +- [Strands Agents SDK](https://strandsagents.com/) +- [Build reliable AI agents with Amazon Bedrock AgentCore Evaluations](https://aws.amazon.com/blogs/machine-learning/build-reliable-ai-agents-with-amazon-bedrock-agentcore-evaluations/) diff --git a/01-tutorials/07-AgentCore-evaluations/05-groundtruth-based-evalautions/groundtruth_evaluations.ipynb b/01-tutorials/07-AgentCore-evaluations/05-groundtruth-based-evalautions/groundtruth_evaluations.ipynb new file mode 100644 index 000000000..2b08cc7c6 --- /dev/null +++ b/01-tutorials/07-AgentCore-evaluations/05-groundtruth-based-evalautions/groundtruth_evaluations.ipynb @@ -0,0 +1,1385 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "intro-md", + "metadata": {}, + "source": [ + "# HR Assistant Agent — Ground Truth Evaluations\n", + "\n", + "This notebook demonstrates evaluation of an agentic application with ground truth using Amazon Bedrock AgentCore Evaluations:\n", + "\n", + "| Interface | When to use |\n", + "|---|---|\n", + "| **EvaluationClient** | You already have agent sessions in CloudWatch. Evaluate specific sessions against reference inputs. |\n", + "| **OnDemandEvaluationDatasetRunner** | You have a test dataset. You want to invoke the agent for every scenario and evaluate the results. |\n", + "\n", + "### The HR Assistant Agent\n", + "\n", + "We'll deploy an **HR Assistant** for Acme Corp — a Strands agent that helps employees with:\n", + "- PTO balance checks and time-off requests\n", + "- HR policy lookups (PTO, remote work, parental leave)\n", + "- Benefits information (health, dental, vision, 401k)\n", + "- Pay stub retrieval\n", + "\n", + "### What You'll Learn\n", + "- How to use `EvaluationClient` to evaluate existing agent sessions logged in Amazon CloudWatch with ground-truth references\n", + "- How to use `OnDemandEvaluationDatasetRunner` to run automated dataset evaluations\n", + "- How to interpret evaluation results across built-in evaluators (Correctness, GoalSuccessRate, Trajectory)\n", + "\n", + "### Tutorial Details\n", + "\n", + "| Information | Details |\n", + "|---|---|\n", + "| Agent framework | Strands Agents |\n", + "| Runtime | Amazon Bedrock AgentCore Runtime |\n", + "| Evaluation SDK | `bedrock-agentcore` |\n", + "| AWS services | AgentCore Runtime, AgentCore Evaluations, CloudWatch Logs |\n", + "\n", + "### Prerequisites\n", + "- Python 3.10+\n", + "- AWS credentials with permissions for AgentCore, Lambda, CloudWatch, ECR, IAM\n", + "- Docker running locally (for agent container image build)" + ] + }, + { + "cell_type": "markdown", + "id": "install-md", + "metadata": {}, + "source": [ + "## Step 1: Install Dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "install", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:08:32.996391Z", + "iopub.status.busy": "2026-03-31T18:08:32.996290Z", + "iopub.status.idle": "2026-03-31T18:08:34.578367Z", + "shell.execute_reply": "2026-03-31T18:08:34.577735Z" + } + }, + "outputs": [], + "source": [ + "!pip install -r requirements.txt -q" + ] + }, + { + "cell_type": "markdown", + "id": "setup-md", + "metadata": {}, + "source": [ + "## Step 2: Configuration\n", + "\n", + "Import libraries and configure your AWS session." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "setup", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:08:34.580804Z", + "iopub.status.busy": "2026-03-31T18:08:34.580612Z", + "iopub.status.idle": "2026-03-31T18:08:34.719835Z", + "shell.execute_reply": "2026-03-31T18:08:34.719215Z" + } + }, + "outputs": [], + "source": [ + "import boto3\n", + "import json\n", + "import time\n", + "import uuid\n", + "from datetime import timedelta\n", + "from boto3.session import Session\n", + "from IPython.display import display, Markdown\n", + "\n", + "region = \"aws_region\" # Add AWS region here \n", + "boto_session = Session(region_name=region)\n", + "REGION = boto_session.region_name\n", + "\n", + "print(f\"Region : {REGION}\")" + ] + }, + { + "cell_type": "markdown", + "id": "deploy-md", + "metadata": {}, + "source": [ + "## Step 3: Deploy the HR Assistant Agent\n", + "\n", + "We deploy the HR Assistant to **AgentCore Runtime** using the AWS SDK directly (no starter toolkit).\n", + "The deployment steps are:\n", + "\n", + "1. **ECR authentication** — use `boto3` to get a temporary ECR token and run `docker login`\n", + "2. **Docker build** — build a container image from `hr_assistant_agent.py` and the local `Dockerfile`\n", + "3. **Docker push** — push the image to Amazon ECR\n", + "4. **CreateAgentRuntime / UpdateAgentRuntime** — register or update the agent endpoint via `bedrock-agentcore-control`\n", + "\n", + "If a `.bedrock_agentcore.yaml` from a previous run is present, its ECR repository, IAM role, and\n", + "existing agent ID are reused so the cell is idempotent (re-running triggers an update, not a duplicate create)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "nn72gdo2s4h", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:08:34.721827Z", + "iopub.status.busy": "2026-03-31T18:08:34.721686Z", + "iopub.status.idle": "2026-03-31T18:08:34.728424Z", + "shell.execute_reply": "2026-03-31T18:08:34.727890Z" + } + }, + "outputs": [], + "source": "%%writefile hr_assistant_agent.py\n\"\"\"\nHR Assistant Agent — Strands agent deployed on Bedrock AgentCore Runtime.\n\nTools (deterministic / mock data for reproducible evaluations):\n get_pto_balance — remaining PTO days for an employee\n submit_pto_request — request time off\n lookup_hr_policy — company policy documents\n get_benefits_summary — health, dental, vision, 401k, life insurance details\n get_pay_stub — pay stub for a given period\n\"\"\"\n\nimport logging\nimport re\n\nfrom bedrock_agentcore.runtime import BedrockAgentCoreApp\nfrom strands import Agent, tool\nfrom strands.models import BedrockModel\n\nlogging.basicConfig(level=logging.INFO)\nlogger = logging.getLogger(__name__)\n\napp = BedrockAgentCoreApp()\n\n# ---------------------------------------------------------------------------\n# Mock data\n# ---------------------------------------------------------------------------\n\n_PTO_BALANCES = {\n \"EMP-001\": {\"total_days\": 15, \"used_days\": 5, \"remaining_days\": 10},\n \"EMP-002\": {\"total_days\": 15, \"used_days\": 12, \"remaining_days\": 3},\n \"EMP-042\": {\"total_days\": 20, \"used_days\": 7, \"remaining_days\": 13},\n}\n\n_HR_POLICIES = {\n \"pto\": (\n \"PTO Policy: Full-time employees accrue 15 days of PTO per year (20 days after 3 years). \"\n \"PTO requests must be submitted at least 2 business days in advance. \"\n \"Unused PTO up to 5 days rolls over to the next year. \"\n \"PTO cannot be taken in advance of accrual.\"\n ),\n \"remote_work\": (\n \"Remote Work Policy: Employees may work remotely up to 3 days per week with manager approval. \"\n \"Core collaboration hours are 10am-3pm local time. \"\n \"A dedicated workspace with reliable internet (25 Mbps+) is required. \"\n \"Employees must be reachable via Slack and email during core hours.\"\n ),\n \"parental_leave\": (\n \"Parental Leave Policy: Primary caregivers receive 16 weeks of fully paid parental leave. \"\n \"Secondary caregivers receive 6 weeks of fully paid parental leave. \"\n \"Leave may begin up to 2 weeks before the expected birth or adoption date. \"\n \"Benefits continue unchanged during parental leave.\"\n ),\n \"code_of_conduct\": (\n \"Code of Conduct: All employees are expected to treat colleagues, customers, and partners \"\n \"with respect and professionalism. Harassment, discrimination, and retaliation of any kind \"\n \"are strictly prohibited. Violations should be reported to HR or via the anonymous hotline.\"\n ),\n}\n\n_BENEFITS = {\n \"health\": (\n \"Health Insurance: The company covers 90% of premiums for employee-only coverage and 75% \"\n \"for family coverage. Plans available: Blue Shield PPO, Kaiser HMO, and HDHP with HSA. \"\n \"Annual deductible: $500 (PPO), $0 (HMO), $1,500 (HDHP). \"\n \"Open enrollment is each November for the following calendar year.\"\n ),\n \"dental\": (\n \"Dental Insurance: 100% coverage for preventive care (cleanings, X-rays). \"\n \"80% coverage for basic restorative care (fillings, extractions). \"\n \"50% coverage for major restorative care (crowns, bridges). \"\n \"Annual maximum benefit: $2,000 per person. Orthodontia lifetime maximum: $1,500.\"\n ),\n \"vision\": (\n \"Vision Insurance: Annual eye exam covered in full. \"\n \"Frames or contacts allowance: $200 per year. \"\n \"Laser vision correction discount: 15% off at participating providers.\"\n ),\n \"401k\": (\n \"401(k) Plan: The company matches 100% of employee contributions up to 4% of salary. \"\n \"An additional 50% match on the next 2% (total effective match up to 5%). \"\n \"Employees are eligible to contribute immediately; company match vests over 3 years. \"\n \"2026 IRS contribution limit: $23,500 (under 50), $31,000 (age 50+).\"\n ),\n \"life_insurance\": (\n \"Life Insurance: Basic life insurance of 2x annual salary provided at no cost. \"\n \"Employees may purchase supplemental coverage up to 5x salary during open enrollment. \"\n \"Accidental death and dismemberment (AD&D) coverage equal to basic life benefit is included.\"\n ),\n}\n\n_PAY_STUBS = {\n (\"EMP-001\", \"2025-12\"): {\n \"gross_pay\": 8333.33,\n \"federal_tax\": 1458.33,\n \"state_tax\": 416.67,\n \"social_security\": 516.67,\n \"medicare\": 120.83,\n \"health_premium\": 125.00,\n \"401k_contribution\": 333.33,\n \"net_pay\": 5362.50,\n \"period\": \"December 2025\",\n },\n (\"EMP-001\", \"2026-01\"): {\n \"gross_pay\": 8333.33,\n \"federal_tax\": 1458.33,\n \"state_tax\": 416.67,\n \"social_security\": 516.67,\n \"medicare\": 120.83,\n \"health_premium\": 125.00,\n \"401k_contribution\": 333.33,\n \"net_pay\": 5362.50,\n \"period\": \"January 2026\",\n },\n (\"EMP-042\", \"2026-01\"): {\n \"gross_pay\": 10416.67,\n \"federal_tax\": 1875.00,\n \"state_tax\": 520.83,\n \"social_security\": 645.83,\n \"medicare\": 151.04,\n \"health_premium\": 200.00,\n \"401k_contribution\": 416.67,\n \"net_pay\": 6607.30,\n \"period\": \"January 2026\",\n },\n}\n\n_PTO_REQUEST_COUNTER = {\"n\": 0}\n\n\n# ---------------------------------------------------------------------------\n# Strands tools\n# ---------------------------------------------------------------------------\n\n\n@tool\ndef get_pto_balance(employee_id: str) -> dict:\n \"\"\"\n Return the current PTO balance for an employee.\n\n Args:\n employee_id: Employee identifier (e.g. EMP-001)\n\n Returns:\n Dict with total_days, used_days, and remaining_days.\n \"\"\"\n balance = _PTO_BALANCES.get(employee_id)\n if balance:\n return {\"employee_id\": employee_id, **balance}\n return {\"employee_id\": employee_id, \"error\": f\"Employee {employee_id} not found.\"}\n\n\n@tool\ndef submit_pto_request(\n employee_id: str,\n start_date: str,\n end_date: str,\n reason: str = \"Personal time off\",\n) -> dict:\n \"\"\"\n Submit a PTO request for an employee.\n\n Args:\n employee_id: Employee identifier (e.g. EMP-001)\n start_date: First day of leave in YYYY-MM-DD format\n end_date: Last day of leave in YYYY-MM-DD format\n reason: Optional reason for the request\n\n Returns:\n Dict with request_id, status, and confirmation message.\n \"\"\"\n _PTO_REQUEST_COUNTER[\"n\"] += 1\n request_id = f\"PTO-2026-{_PTO_REQUEST_COUNTER['n']:03d}\"\n return {\n \"request_id\": request_id,\n \"employee_id\": employee_id,\n \"start_date\": start_date,\n \"end_date\": end_date,\n \"reason\": reason,\n \"status\": \"APPROVED\",\n \"message\": f\"PTO request {request_id} approved for {employee_id} from {start_date} to {end_date}.\",\n }\n\n\n@tool\ndef lookup_hr_policy(topic: str) -> dict:\n \"\"\"\n Look up a company HR policy document by topic.\n\n Args:\n topic: Policy topic. Supported values: pto, remote_work, parental_leave, code_of_conduct\n\n Returns:\n Dict with topic and policy_text.\n \"\"\"\n key = topic.lower().replace(\" \", \"_\").replace(\"-\", \"_\")\n text = _HR_POLICIES.get(key)\n if text:\n return {\"topic\": topic, \"policy_text\": text}\n return {\n \"topic\": topic,\n \"error\": f\"Policy '{topic}' not found. Available: {list(_HR_POLICIES.keys())}\",\n }\n\n\n@tool\ndef get_benefits_summary(benefit_type: str) -> dict:\n \"\"\"\n Return a summary of a specific employee benefit.\n\n Args:\n benefit_type: Type of benefit. Supported values: health, dental, vision, 401k, life_insurance\n\n Returns:\n Dict with benefit_type and summary text.\n \"\"\"\n key = benefit_type.lower().replace(\" \", \"_\").replace(\"-\", \"_\")\n text = _BENEFITS.get(key)\n if text:\n return {\"benefit_type\": benefit_type, \"summary\": text}\n return {\n \"benefit_type\": benefit_type,\n \"error\": f\"Benefit '{benefit_type}' not found. Available: {list(_BENEFITS.keys())}\",\n }\n\n\n@tool\ndef get_pay_stub(employee_id: str, period: str) -> dict:\n \"\"\"\n Retrieve a pay stub for an employee for a specific pay period.\n\n Args:\n employee_id: Employee identifier (e.g. EMP-001)\n period: Pay period in YYYY-MM format (e.g. 2026-01)\n\n Returns:\n Dict with gross pay, deductions, and net pay.\n \"\"\"\n stub = _PAY_STUBS.get((employee_id, period))\n if stub:\n return {\"employee_id\": employee_id, **stub}\n return {\n \"employee_id\": employee_id,\n \"period\": period,\n \"error\": f\"Pay stub not found for {employee_id} period {period}.\",\n }\n\n\n# ---------------------------------------------------------------------------\n# Agent\n# ---------------------------------------------------------------------------\n\nSYSTEM_PROMPT = \"\"\"You are a helpful HR Assistant for Acme Corp.\n\nYou help employees with:\n- Checking PTO (paid time off) balances\n- Submitting PTO requests\n- Looking up HR policies (PTO, remote work, parental leave, code of conduct)\n- Understanding employee benefits (health, dental, vision, 401k, life insurance)\n- Retrieving pay stub information\n\nAlways use the available tools to answer questions accurately. Do not make up\npolicy details, benefit amounts, or pay information — look them up.\nBe concise, professional, and friendly.\"\"\"\n\n_MODEL = BedrockModel(model_id=\"us.amazon.nova-lite-v1:0\")\n_TOOLS = [\n get_pto_balance,\n submit_pto_request,\n lookup_hr_policy,\n get_benefits_summary,\n get_pay_stub,\n]\n\n# Session cache: session_id -> Agent (preserves conversation history across turns)\n_SESSION_AGENTS: dict[str, Agent] = {}\n\n\n@app.entrypoint\nasync def invoke(payload, context):\n \"\"\"Handle an agent invocation from AgentCore Runtime.\"\"\"\n prompt = payload.get(\"prompt\", \"\")\n session_id = context.session_id\n logger.info(\"Received prompt (session=%s): %s\", session_id, prompt[:80])\n\n if session_id and session_id in _SESSION_AGENTS:\n agent = _SESSION_AGENTS[session_id]\n else:\n agent = Agent(model=_MODEL, tools=_TOOLS, system_prompt=SYSTEM_PROMPT)\n if session_id:\n _SESSION_AGENTS[session_id] = agent\n\n parts = []\n async for event in agent.stream_async(prompt):\n if \"data\" in event:\n parts.append(str(event[\"data\"]))\n response = \"\".join(parts)\n # Strip inline ... blocks so spans contain only the final answer\n response = re.sub(\n r\".*?\", \"\", response, flags=re.DOTALL\n ).strip()\n return response\n\n\nif __name__ == \"__main__\":\n app.run()" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "deploy", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:08:34.729896Z", + "iopub.status.busy": "2026-03-31T18:08:34.729788Z", + "iopub.status.idle": "2026-03-31T18:09:25.948136Z", + "shell.execute_reply": "2026-03-31T18:09:25.947045Z" + } + }, + "outputs": [], + "source": [ + "from bedrock_agentcore_starter_toolkit import Runtime\n", + "\n", + "_REGION = REGION or \"us-east-1\"\n", + "\n", + "agentcore_runtime = Runtime()\n", + "agentcore_runtime.configure(\n", + " entrypoint=\"hr_assistant_agent.py\",\n", + " agent_name=\"hr_assistant_eval_tutorial\",\n", + " region=_REGION,\n", + " auto_create_execution_role=True,\n", + " auto_create_ecr=True,\n", + " requirements_file=\"requirements.txt\",\n", + " non_interactive=True,\n", + ")\n", + "print(\"Configuration complete.\")\n", + "\n", + "print(\"\\nDeploying HR Assistant Agent ...\")\n", + "print(\" This takes ~5 minutes on first run (image build + push + runtime creation).\")\n", + "print()\n", + "\n", + "_launch = agentcore_runtime.launch(auto_update_on_conflict=True)\n", + "\n", + "print(f\"\\nLaunch complete.\")\n", + "print(f\" agent_id : {_launch.agent_id}\")\n", + "print(f\" agent_arn : {_launch.agent_arn}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "wait-deploy", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:09:25.951112Z", + "iopub.status.busy": "2026-03-31T18:09:25.950863Z", + "iopub.status.idle": "2026-03-31T18:09:26.424226Z", + "shell.execute_reply": "2026-03-31T18:09:26.423531Z" + } + }, + "outputs": [], + "source": [ + "import time\n", + "\n", + "print(\"Waiting for agent to reach READY status ...\")\n", + "\n", + "_POLL_INTERVAL = 15 # seconds between status checks\n", + "_MAX_WAIT = 600 # 10-minute timeout\n", + "\n", + "_elapsed = 0\n", + "while _elapsed < _MAX_WAIT:\n", + " _status_result = agentcore_runtime.status()\n", + " _agent_info = _status_result.agent or {}\n", + " _agent_status = _agent_info.get(\"status\", \"UNKNOWN\")\n", + " print(f\" [{_elapsed:>3}s] status = {_agent_status}\")\n", + "\n", + " if _agent_status in (\"READY\", \"ACTIVE\"):\n", + " print(f\"\\nAgent is {_agent_status}. Proceeding.\")\n", + " break\n", + " if _agent_status in (\"FAILED\", \"CREATE_FAILED\", \"UPDATE_FAILED\"):\n", + " raise RuntimeError(\n", + " f\"Agent deployment failed with status '{_agent_status}'.\\n\"\n", + " f\"Details: {_agent_info}\"\n", + " )\n", + "\n", + " time.sleep(_POLL_INTERVAL)\n", + " _elapsed += _POLL_INTERVAL\n", + "else:\n", + " raise TimeoutError(\n", + " f\"Agent did not reach READY status within {_MAX_WAIT}s. \"\n", + " \"Check the AgentCore console for details.\"\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "agent-config", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:09:26.425933Z", + "iopub.status.busy": "2026-03-31T18:09:26.425801Z", + "iopub.status.idle": "2026-03-31T18:09:26.430602Z", + "shell.execute_reply": "2026-03-31T18:09:26.430128Z" + } + }, + "outputs": [], + "source": [ + "AGENT_ID = _launch.agent_id\n", + "AGENT_ARN = _launch.agent_arn\n", + "CW_LOG_GROUP = f\"/aws/bedrock-agentcore/runtimes/{AGENT_ID}-DEFAULT\"\n", + "\n", + "agentcore_client = boto3.client(\"bedrock-agentcore\", region_name=_REGION)\n", + "\n", + "print(f\"AGENT_ID : {AGENT_ID}\")\n", + "print(f\"AGENT_ARN : {AGENT_ARN}\")\n", + "print(f\"CW_LOG_GROUP : {CW_LOG_GROUP}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "store-agent", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:09:26.431991Z", + "iopub.status.busy": "2026-03-31T18:09:26.431884Z", + "iopub.status.idle": "2026-03-31T18:09:26.436625Z", + "shell.execute_reply": "2026-03-31T18:09:26.436137Z" + } + }, + "outputs": [], + "source": [ + "# Persist agent info\n", + "%store AGENT_ID\n", + "%store AGENT_ARN\n", + "%store CW_LOG_GROUP\n", + "%store REGION" + ] + }, + { + "cell_type": "markdown", + "id": "91312ba2", + "metadata": {}, + "source": [ + "## Step 3: Invoke the Agent to Generate Sessions\n", + "\n", + "Before we can evaluate, we need agent sessions with CloudWatch spans. We'll invoke the agent\n", + "for several scenarios and record the session IDs for use with `EvaluationClient`.\n", + "\n", + "Each session corresponds to one evaluation scenario." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6fb84d7e", + "metadata": {}, + "outputs": [], + "source": [ + "def invoke_agent(prompt: str, session_id: str) -> str:\n", + " \"\"\"Send a single prompt to the HR assistant and return its text response.\"\"\"\n", + " resp = agentcore_client.invoke_agent_runtime(\n", + " agentRuntimeArn=AGENT_ARN,\n", + " qualifier=\"DEFAULT\",\n", + " runtimeSessionId=session_id,\n", + " payload=json.dumps({\"prompt\": prompt}).encode(\"utf-8\"),\n", + " )\n", + " raw = resp[\"response\"].read().decode(\"utf-8\")\n", + " parts = []\n", + " for line in raw.splitlines():\n", + " if line.startswith(\"data: \"):\n", + " chunk = line[len(\"data: \"):]\n", + " try:\n", + " chunk = json.loads(chunk)\n", + " except Exception:\n", + " pass\n", + " parts.append(str(chunk))\n", + " return \"\".join(parts) if parts else raw\n", + "\n", + "\n", + "def run_session(turns: list[str], session_prefix: str) -> str:\n", + " \"\"\"Invoke a multi-turn session and return its session ID.\"\"\"\n", + " session_id = f\"{session_prefix}-{uuid.uuid4()}\"\n", + " print(f\"Session: {session_id}\")\n", + " for turn_input in turns:\n", + " print(f\" > {turn_input[:70]}\")\n", + " response = invoke_agent(turn_input, session_id)\n", + " print(f\" < {response[:100]}\")\n", + " return session_id" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7ca4e6c", + "metadata": {}, + "outputs": [], + "source": [ + "# --- Single-turn sessions ---\n", + "\n", + "print(\"=== Single-Turn Sessions ===\")\n", + "\n", + "session_pto_balance = run_session(\n", + " [\"What is the current PTO balance for employee EMP-001?\"],\n", + " \"pto-balance-check\"\n", + ")\n", + "\n", + "session_submit_pto = run_session(\n", + " [\"Please submit a PTO request for employee EMP-001 from 2026-04-14 to 2026-04-16 for a family vacation.\"],\n", + " \"submit-pto-request\"\n", + ")\n", + "\n", + "session_pay_stub = run_session(\n", + " [\"Can you pull up the January 2026 pay stub for employee EMP-001?\"],\n", + " \"pay-stub-lookup\"\n", + ")\n", + "\n", + "print(\"\\nSingle-turn sessions created.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c603619", + "metadata": {}, + "outputs": [], + "source": [ + "# --- Multi-turn session: PTO planning ---\n", + "\n", + "print(\"=== Multi-Turn Session: PTO Planning ===\")\n", + "\n", + "session_pto_planning = run_session(\n", + " [\n", + " \"How many PTO days do I have left? My employee ID is EMP-001.\",\n", + " \"Great. I'd like to take December 23 to December 25 off. Please submit a request.\",\n", + " \"Remind me — what is the policy on rolling over unused PTO?\",\n", + " ],\n", + " \"pto-planning-session\"\n", + ")\n", + "\n", + "print(\"\\nMulti-turn session created.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "42ff04a5", + "metadata": {}, + "outputs": [], + "source": [ + "# --- Multi-turn session: New employee onboarding ---\n", + "\n", + "print(\"=== Multi-Turn Session: New Employee Onboarding ===\")\n", + "\n", + "session_onboarding = run_session(\n", + " [\n", + " \"I just joined the company. What is the remote work policy?\",\n", + " \"How much PTO do I get as a new employee?\",\n", + " \"What life insurance benefit does the company provide?\",\n", + " \"Can you check the current PTO balance for employee EMP-042?\",\n", + " ],\n", + " \"new-employee-onboarding\"\n", + ")\n", + "\n", + "print(\"\\nAll sessions created. Waiting 60s for CloudWatch log ingestion...\")\n", + "time.sleep(60)\n", + "print(\"Ready to evaluate.\")" + ] + }, + { + "cell_type": "markdown", + "id": "eval-client-md", + "metadata": {}, + "source": [ + "## Step 5: EvaluationClient — Evaluate Existing Sessions\n", + "\n", + "`EvaluationClient` is the right tool when you **already have agent sessions** logged in CloudWatch and you want to test them against your ground truth in a ad-hoc manner.\n", + "It looks up the agent's spans for a given `session_id` and runs evaluators against them. For these evaluations, you can pass in an expected response, assertions and expected trajectory. You can use the Built-in evaluators as well as the custom evaluators.\n", + "\n", + "### Ground-Truth Reference Inputs\n", + "\n", + "`ReferenceInputs` lets you supply optional ground truth:\n", + "\n", + "| Field | Evaluators that use it | Description |\n", + "|---|---|---|\n", + "| `expected_response` | `Builtin.Correctness` | The ideal response text |\n", + "| `expected_trajectory` | `Builtin.TrajectoryExactOrderMatch`, `Builtin.TrajectoryInOrderMatch`, `Builtin.TrajectoryAnyOrderMatch` | Ordered list of tool names |\n", + "| `assertions` | `Builtin.GoalSuccessRate` | Free-text assertions the session should satisfy |\n", + "\n", + "Evaluators that don't need ground truth (`Helpfulness`, `ResponseRelevance`) can be included in the same call.\n", + "Each evaluator only reads the fields it needs." + ] + }, + { + "cell_type": "markdown", + "id": "4d583593", + "metadata": {}, + "source": [ + "## Create Custom (LLM-as-a-Judge) Evaluators\n", + "\n", + "In addition to built-in evaluators, you can define your own evaluation criteria using\n", + "**LLM-as-a-Judge custom evaluators**. These accept natural language instructions that\n", + "can reference **ground truth placeholders** automatically substituted at evaluation time.\n", + "\n", + "### Ground truth placeholders\n", + "\n", + "| Level | Available placeholders |\n", + "|---|---|\n", + "| **TRACE** | `{context}`, `{assistant_turn}`, `{expected_response}` |\n", + "| **SESSION** | `{context}`, `{available_tools}`, `{actual_tool_trajectory}`, `{expected_tool_trajectory}`, `{assertions}` |\n", + "\n", + "For example, a trace-level evaluator comparing response similarity would include\n", + "`{assistant_turn}` and `{expected_response}` in its instructions. When the evaluator runs,\n", + "the service substitutes those placeholders with the actual agent output and the\n", + "`expectedResponse` from `ReferenceInputs`.\n", + "\n", + "### What we'll create\n", + "\n", + "| Evaluator | Level | Placeholders | Description |\n", + "|---|---|---|---|\n", + "| `HRResponseSimilarity` | TRACE | `{assistant_turn}`, `{expected_response}` | How closely the agent's response matches the expected answer |\n", + "| `HRAssertionChecker` | SESSION | `{actual_tool_trajectory}`, `{expected_tool_trajectory}`, `{assertions}` | Whether the agent called the right tools and satisfied all session assertions |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "99775e89", + "metadata": {}, + "outputs": [], + "source": [ + "import uuid\n", + "\n", + "_SUFFIX = uuid.uuid4().hex[:8]\n", + "_cp = boto3.client(\"bedrock-agentcore-control\", region_name=_REGION)\n", + "\n", + "# ---------------------------------------------------------------------------\n", + "# Trace-level: HRResponseSimilarity\n", + "# Compares the agent's response to the expected_response reference input.\n", + "# {assistant_turn} → actual agent output\n", + "# {expected_response} → expectedResponse field in ReferenceInputs\n", + "# ---------------------------------------------------------------------------\n", + "print(\"Creating HRResponseSimilarity (TRACE) ...\")\n", + "_resp_sim = _cp.create_evaluator(\n", + " evaluatorName=f\"HRResponseSimilarity_{_SUFFIX}\",\n", + " level=\"TRACE\",\n", + " evaluatorConfig={\n", + " \"llmAsAJudge\": {\n", + " \"instructions\": (\n", + " \"Compare the agent's response with the expected response.\\n\"\n", + " \"Agent response: {assistant_turn}\\n\"\n", + " \"Expected response: {expected_response}\\n\\n\"\n", + " \"Rate how closely the agent's response matches the expected response. \"\n", + " \"Focus on whether the key facts, numbers, and conclusions agree.\"\n", + " ),\n", + " \"ratingScale\": {\n", + " \"numerical\": [\n", + " {\n", + " \"value\": 0.0,\n", + " \"label\": \"not_similar\",\n", + " \"definition\": \"Response is factually different or missing key information from the expected response.\",\n", + " },\n", + " {\n", + " \"value\": 0.5,\n", + " \"label\": \"partially_similar\",\n", + " \"definition\": \"Response captures some expected content but omits or misrepresents parts.\",\n", + " },\n", + " {\n", + " \"value\": 1.0,\n", + " \"label\": \"highly_similar\",\n", + " \"definition\": \"Response is semantically equivalent to the expected response — all key facts match.\",\n", + " },\n", + " ]\n", + " },\n", + " \"modelConfig\": {\n", + " \"bedrockEvaluatorModelConfig\": {\n", + " \"modelId\": \"us.amazon.nova-lite-v1:0\",\n", + " \"inferenceConfig\": {\"maxTokens\": 512},\n", + " }\n", + " },\n", + " }\n", + " },\n", + ")\n", + "CUSTOM_RESPONSE_SIMILARITY_ID = _resp_sim[\"evaluatorId\"]\n", + "print(f\" evaluatorId : {CUSTOM_RESPONSE_SIMILARITY_ID}\")\n", + "\n", + "# ---------------------------------------------------------------------------\n", + "# Session-level: HRAssertionChecker\n", + "# Evaluates tool trajectory compliance and assertion satisfaction.\n", + "# {actual_tool_trajectory} → tools the agent actually called\n", + "# {expected_tool_trajectory} → expectedTrajectory from ReferenceInputs\n", + "# {assertions} → assertions list from ReferenceInputs\n", + "# ---------------------------------------------------------------------------\n", + "print(\"\\nCreating HRAssertionChecker (SESSION) ...\")\n", + "_assert_chk = _cp.create_evaluator(\n", + " evaluatorName=f\"HRAssertionChecker_{_SUFFIX}\",\n", + " level=\"SESSION\",\n", + " evaluatorConfig={\n", + " \"llmAsAJudge\": {\n", + " \"instructions\": (\n", + " \"Evaluate whether the agent fulfilled the session requirements.\\n\\n\"\n", + " \"Expected tool trajectory: {expected_tool_trajectory}\\n\"\n", + " \"Actual tool trajectory: {actual_tool_trajectory}\\n\"\n", + " \"Assertions to verify: {assertions}\\n\\n\"\n", + " \"Score the agent on how well it followed the expected tool trajectory \"\n", + " \"and satisfied every listed assertion.\"\n", + " ),\n", + " \"ratingScale\": {\n", + " \"numerical\": [\n", + " {\n", + " \"value\": 0.0,\n", + " \"label\": \"failed\",\n", + " \"definition\": \"Agent did not follow the trajectory and failed most assertions.\",\n", + " },\n", + " {\n", + " \"value\": 0.5,\n", + " \"label\": \"partial\",\n", + " \"definition\": \"Agent partially followed the trajectory or satisfied only some assertions.\",\n", + " },\n", + " {\n", + " \"value\": 1.0,\n", + " \"label\": \"passed\",\n", + " \"definition\": \"Agent followed the expected trajectory and satisfied all assertions.\",\n", + " },\n", + " ]\n", + " },\n", + " \"modelConfig\": {\n", + " \"bedrockEvaluatorModelConfig\": {\n", + " \"modelId\": \"us.amazon.nova-lite-v1:0\",\n", + " \"inferenceConfig\": {\"maxTokens\": 512},\n", + " }\n", + " },\n", + " }\n", + " },\n", + ")\n", + "CUSTOM_ASSERTION_CHECKER_ID = _assert_chk[\"evaluatorId\"]\n", + "print(f\" evaluatorId : {CUSTOM_ASSERTION_CHECKER_ID}\")\n", + "\n", + "print(f\"\\nCustom evaluators ready:\")\n", + "print(f\" HRResponseSimilarity (TRACE) : {CUSTOM_RESPONSE_SIMILARITY_ID}\")\n", + "print(f\" HRAssertionChecker (SESSION) : {CUSTOM_ASSERTION_CHECKER_ID}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eval-client-init", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:11:21.683849Z", + "iopub.status.busy": "2026-03-31T18:11:21.683641Z", + "iopub.status.idle": "2026-03-31T18:11:21.747795Z", + "shell.execute_reply": "2026-03-31T18:11:21.746884Z" + } + }, + "outputs": [], + "source": [ + "from bedrock_agentcore.evaluation import EvaluationClient, ReferenceInputs\n", + "\n", + "eval_client = EvaluationClient(region_name=REGION)\n", + "\n", + "print(f\"EvaluationClient initialised (region={REGION})\")\n", + "print(f\" {CUSTOM_RESPONSE_SIMILARITY_ID} → TRACE (custom: HRResponseSimilarity)\")\n", + "print(f\" {CUSTOM_ASSERTION_CHECKER_ID} → SESSION (custom: HRAssertionChecker)\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "print-helper", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:11:21.749705Z", + "iopub.status.busy": "2026-03-31T18:11:21.749545Z", + "iopub.status.idle": "2026-03-31T18:11:21.753571Z", + "shell.execute_reply": "2026-03-31T18:11:21.752959Z" + } + }, + "outputs": [], + "source": [ + "# Helper function for printing\n", + "def display_eval_results(label: str, results: list) -> None:\n", + " \"\"\"Pretty-print EvaluationClient results as a markdown table.\"\"\"\n", + " rows = [\"| Evaluator | Value | Label | Explanation |\",\n", + " \"|---|---|---|---|\"]\n", + " for r in results:\n", + " evaluator = r.get(\"evaluatorId\", \"\")[:40]\n", + " value = str(r.get(\"value\", r.get(\"score\", \"N/A\")))\n", + " lbl = str(r.get(\"label\", r.get(\"rating\", \"\")))\n", + " explanation = (r.get(\"explanation\", r.get(\"reason\", \"\")) or \"\")[:120].replace(\"\\n\", \" \")\n", + " error_code = r.get(\"errorCode\")\n", + " if error_code:\n", + " lbl = f\"ERR:{error_code}\"\n", + " explanation = (r.get(\"errorMessage\", \"\") or \"\")[:120]\n", + " rows.append(f\"| `{evaluator}` | {value} | {lbl} | {explanation} |\")\n", + "\n", + " if len(rows) == 2: # only header rows, no data\n", + " rows.append(\"| No results — session may be too recent or spans not yet visible | | | |\")\n", + "\n", + " md = f\"### {label}\\n\\n\" + \"\\n\".join(rows)\n", + " display(Markdown(md))" + ] + }, + { + "cell_type": "markdown", + "id": "ec-single-md", + "metadata": {}, + "source": [ + "### 5a. Single-Turn: PTO Balance — Correctness + Helpfulness + Custom ResponseSimilarity\n", + "\n", + "We evaluate the PTO balance response against a known expected answer using `Builtin.Correctness`\n", + "and the custom `HRResponseSimilarity` evaluator (which uses the `{assistant_turn}` and\n", + "`{expected_response}` placeholders). Both measure factual accuracy but use different scoring rubrics." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec-pto-balance", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:11:21.755187Z", + "iopub.status.busy": "2026-03-31T18:11:21.755069Z", + "iopub.status.idle": "2026-03-31T18:11:43.660604Z", + "shell.execute_reply": "2026-03-31T18:11:43.659651Z" + } + }, + "outputs": [], + "source": [ + "pto_balance_results = eval_client.run(\n", + " evaluator_ids=[\n", + " \"Builtin.Correctness\", # TRACE: compares with provided expected response\n", + " \"Builtin.Helpfulness\", # TRACE: no ground truth needed\n", + " \"Builtin.ResponseRelevance\", # TRACE: no ground truth needed\n", + " CUSTOM_RESPONSE_SIMILARITY_ID, # TRACE: custom — uses {assistant_turn} + {expected_response}\n", + " ],\n", + " session_id=session_pto_balance,\n", + " agent_id=AGENT_ID,\n", + " look_back_time=timedelta(hours=2),\n", + " reference_inputs=ReferenceInputs(\n", + " expected_response=\"Employee EMP-001 has 10 remaining PTO days out of 15 total (5 days used).\",\n", + " ),\n", + ")\n", + "\n", + "display_eval_results(\"PTO Balance — Correctness + Quality + Custom ResponseSimilarity\", pto_balance_results)" + ] + }, + { + "cell_type": "markdown", + "id": "ec-traj-md", + "metadata": {}, + "source": [ + "### 5b. Single-Turn: PTO Submission — Assertions + Trajectory + Custom AssertionChecker\n", + "\n", + "This cell runs both built-in trajectory evaluators **and** the custom `HRAssertionChecker`\n", + "(which uses `{actual_tool_trajectory}`, `{expected_tool_trajectory}`, and `{assertions}` placeholders)\n", + "plus the custom `HRResponseSimilarity` for the response. This lets you compare built-in vs. custom\n", + "scoring side by side." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec-submit-pto", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:11:43.665034Z", + "iopub.status.busy": "2026-03-31T18:11:43.664870Z", + "iopub.status.idle": "2026-03-31T18:11:58.887975Z", + "shell.execute_reply": "2026-03-31T18:11:58.887479Z" + } + }, + "outputs": [], + "source": [ + "submit_pto_results = eval_client.run(\n", + " evaluator_ids=[\n", + " \"Builtin.GoalSuccessRate\", # SESSION: built-in assertion evaluator\n", + " \"Builtin.TrajectoryExactOrderMatch\", # SESSION: built-in trajectory evaluator\n", + " \"Builtin.TrajectoryAnyOrderMatch\", # SESSION: built-in trajectory evaluator\n", + " \"Builtin.Correctness\", # TRACE: built-in response accuracy\n", + " CUSTOM_RESPONSE_SIMILARITY_ID, # TRACE (custom): {assistant_turn} + {expected_response}\n", + " ],\n", + " session_id=session_submit_pto,\n", + " agent_id=AGENT_ID,\n", + " look_back_time=timedelta(hours=2),\n", + " reference_inputs=ReferenceInputs(\n", + " expected_trajectory=[\"submit_pto_request\"],\n", + " assertions=[\n", + " \"Agent called submit_pto_request for employee EMP-001\",\n", + " \"Agent confirmed the PTO request was approved\",\n", + " \"Agent provided a request ID (e.g. PTO-2026-001)\",\n", + " ],\n", + " expected_response=\"PTO request submitted and approved for EMP-001 from 2026-04-14 to 2026-04-16.\",\n", + " ),\n", + ")\n", + "\n", + "display_eval_results(\"PTO Submission — Built-in + Custom ResponseSimilarity\", submit_pto_results)" + ] + }, + { + "cell_type": "markdown", + "id": "ec-paystub-md", + "metadata": {}, + "source": [ + "### 5c. Single-Turn: Pay Stub — Factual Correctness\n", + "\n", + "Factual data retrieval scenarios are well-suited for `Builtin.Correctness` combined with\n", + "`Builtin.GoalSuccessRate`. The expected_response provides the ground truth figures." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec-pay-stub", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:11:58.889746Z", + "iopub.status.busy": "2026-03-31T18:11:58.889635Z", + "iopub.status.idle": "2026-03-31T18:12:01.774714Z", + "shell.execute_reply": "2026-03-31T18:12:01.773612Z" + } + }, + "outputs": [], + "source": [ + "pay_stub_results = eval_client.run(\n", + " evaluator_ids=[\n", + " \"Builtin.Correctness\",\n", + " \"Builtin.GoalSuccessRate\",\n", + " ],\n", + " session_id=session_pay_stub,\n", + " agent_id=AGENT_ID,\n", + " look_back_time=timedelta(hours=2),\n", + " reference_inputs=ReferenceInputs(\n", + " expected_response=\"EMP-001 January 2026: gross pay $8,333.33, net pay $5,362.50.\",\n", + " assertions=[\n", + " \"Agent called get_pay_stub for EMP-001 period 2026-01\",\n", + " \"Agent reported the correct gross pay of $8,333.33\",\n", + " \"Agent reported the correct net pay of $5,362.50\",\n", + " ],\n", + " ),\n", + ")\n", + "\n", + "display_eval_results(\"Pay Stub Lookup — Correctness + GoalSuccessRate\", pay_stub_results)" + ] + }, + { + "cell_type": "markdown", + "id": "ec-multi-md", + "metadata": {}, + "source": [ + "### 5d. Multi-Turn: PTO Planning Session (3 turns) + Custom AssertionChecker\n", + "\n", + "For multi-turn sessions, `EvaluationClient` fetches all spans for the session and evaluates\n", + "the complete conversation. The trajectory and assertions apply across all turns.\n", + "\n", + "This scenario also exercises the custom `HRAssertionChecker` evaluator (SESSION level),\n", + "which uses `{actual_tool_trajectory}`, `{expected_tool_trajectory}`, and `{assertions}`\n", + "placeholders. A 3-turn session with distinct tool calls per turn gives the evaluator\n", + "a rich trajectory to compare against the expected sequence." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec-multi-pto", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:12:01.777633Z", + "iopub.status.busy": "2026-03-31T18:12:01.777461Z", + "iopub.status.idle": "2026-03-31T18:12:06.100420Z", + "shell.execute_reply": "2026-03-31T18:12:06.099869Z" + } + }, + "outputs": [], + "source": [ + "pto_planning_results = eval_client.run(\n", + " evaluator_ids=[\n", + " \"Builtin.GoalSuccessRate\",\n", + " \"Builtin.TrajectoryExactOrderMatch\",\n", + " \"Builtin.TrajectoryInOrderMatch\",\n", + " \"Builtin.TrajectoryAnyOrderMatch\",\n", + " \"Builtin.Helpfulness\",\n", + " CUSTOM_ASSERTION_CHECKER_ID, # SESSION (custom): {actual_tool_trajectory} + {expected_tool_trajectory} + {assertions}\n", + " ],\n", + " session_id=session_pto_planning,\n", + " agent_id=AGENT_ID,\n", + " look_back_time=timedelta(hours=2),\n", + " reference_inputs=ReferenceInputs(\n", + " expected_trajectory=[\"get_pto_balance\", \"submit_pto_request\", \"lookup_hr_policy\"],\n", + " assertions=[\n", + " \"Agent correctly reported 10 remaining PTO days for EMP-001 in turn 1\",\n", + " \"Agent submitted a PTO request for December 23-25, 2026 in turn 2\",\n", + " \"Agent correctly stated the 5-day PTO rollover limit in turn 3\",\n", + " ],\n", + " ),\n", + ")\n", + "\n", + "display_eval_results(\"PTO Planning — Multi-Turn (3 turns) + Custom AssertionChecker\", pto_planning_results)" + ] + }, + { + "cell_type": "markdown", + "id": "runner-md", + "metadata": {}, + "source": [ + "## Step 6: OnDemandEvaluationDatasetRunner — Automated Dataset Evaluation\n", + "\n", + "`OnDemandEvaluationDatasetRunner` is the right tool when you have a **test dataset** and want to:\n", + "1. Automatically invoke your agent for each scenario\n", + "2. Collect CloudWatch spans\n", + "3. Run evaluators against each scenario's results\n", + "\n", + "This is ideal for regression testing, CI/CD pipelines, and batch evaluation against curated datasets.\n", + "\n", + "### Dataset structure\n", + "\n", + "A dataset consists of **scenarios**, each with one or more **turns**. Optional ground-truth fields:\n", + "- `Turn.expected_response` — per-turn expected answer\n", + "- `PreDefinedScenario.expected_trajectory` — ordered list of tool names\n", + "- `PreDefinedScenario.assertions` — session-level assertions\n", + "\n", + "### How OnDemandEvaluationDatasetRunner works\n", + "\n", + "```\n", + "For each scenario:\n", + " 1. Create a new session ID\n", + " 2. Call your agent_invoker function for each turn\n", + " 3. Wait for CloudWatch spans to appear (evaluation_delay_seconds)\n", + " 4. Submit spans + ground truth to the evaluation service\n", + " 5. Collect and return results\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "runner-imports", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:12:09.650245Z", + "iopub.status.busy": "2026-03-31T18:12:09.649779Z", + "iopub.status.idle": "2026-03-31T18:12:09.655809Z", + "shell.execute_reply": "2026-03-31T18:12:09.655250Z" + } + }, + "outputs": [], + "source": [ + "from bedrock_agentcore.evaluation import (\n", + " AgentInvokerInput,\n", + " AgentInvokerOutput,\n", + " CloudWatchAgentSpanCollector,\n", + " Dataset,\n", + " EvaluationRunConfig,\n", + " OnDemandEvaluationDatasetRunner,\n", + " EvaluatorConfig,\n", + " Turn,\n", + " PredefinedScenario,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "runner-invoker", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:12:09.657880Z", + "iopub.status.busy": "2026-03-31T18:12:09.657740Z", + "iopub.status.idle": "2026-03-31T18:12:09.663480Z", + "shell.execute_reply": "2026-03-31T18:12:09.662844Z" + } + }, + "outputs": [], + "source": [ + "def agent_invoker(invoker_input: AgentInvokerInput) -> AgentInvokerOutput:\n", + " \"\"\"\n", + " Called by OnDemandEvaluationDatasetRunner once per turn. Invoke the HR assistant\n", + " and return the text response.\n", + "\n", + " AgentInvokerInput fields:\n", + " - payload: The turn input (str or dict) from the dataset.\n", + " - session_id: Framework-managed session ID, stable across all turns\n", + " in a scenario. Pass it to your agent for conversation continuity.\n", + " \"\"\"\n", + " payload = invoker_input.payload\n", + " body = {\"prompt\": payload} if isinstance(payload, str) else payload\n", + "\n", + " resp = agentcore_client.invoke_agent_runtime(\n", + " agentRuntimeArn=AGENT_ARN,\n", + " qualifier=\"DEFAULT\",\n", + " runtimeSessionId=invoker_input.session_id,\n", + " payload=json.dumps(body).encode(\"utf-8\"),\n", + " )\n", + "\n", + " raw = resp[\"response\"].read().decode(\"utf-8\")\n", + " parts = []\n", + " for line in raw.splitlines():\n", + " if line.startswith(\"data: \"):\n", + " chunk = line[len(\"data: \"):]\n", + " try:\n", + " chunk = json.loads(chunk)\n", + " except Exception:\n", + " pass\n", + " parts.append(str(chunk))\n", + " return AgentInvokerOutput(agent_output=\"\".join(parts) if parts else raw)" + ] + }, + { + "cell_type": "markdown", + "id": "runner-dataset-md", + "metadata": {}, + "source": [ + "### 6a. Define the Evaluation Dataset\n", + "\n", + "We define scenarios inline. A mix of single-turn and multi-turn scenarios exercises\n", + "different aspects of the agent." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "runner-dataset", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:12:09.665778Z", + "iopub.status.busy": "2026-03-31T18:12:09.665579Z", + "iopub.status.idle": "2026-03-31T18:12:09.670891Z", + "shell.execute_reply": "2026-03-31T18:12:09.670266Z" + } + }, + "outputs": [], + "source": [ + "dataset = Dataset(\n", + " scenarios=[\n", + " # --- Single-turn: PTO balance ---\n", + " PredefinedScenario(\n", + " scenario_id=\"pto-balance-check\",\n", + " turns=[\n", + " Turn(\n", + " input=\"What is the current PTO balance for employee EMP-001?\",\n", + " expected_response=\"Employee EMP-001 has 10 remaining PTO days out of 15 total (5 days used).\",\n", + " )\n", + " ],\n", + " expected_trajectory=[\"get_pto_balance\"],\n", + " assertions=[\n", + " \"Agent called get_pto_balance with employee_id=EMP-001\",\n", + " \"Agent reported 10 remaining PTO days\",\n", + " ],\n", + " ),\n", + "\n", + " # --- Single-turn: HR policy lookup ---\n", + " PredefinedScenario(\n", + " scenario_id=\"pto-policy-lookup\",\n", + " turns=[\n", + " Turn(\n", + " input=\"What is the company PTO policy?\",\n", + " expected_response=\"Full-time employees accrue 15 days of PTO per year. Requests must be submitted at least 2 business days in advance. Up to 5 unused days roll over each year.\",\n", + " )\n", + " ],\n", + " expected_trajectory=[\"lookup_hr_policy\"],\n", + " assertions=[\n", + " \"Agent called lookup_hr_policy with topic=pto\",\n", + " \"Agent mentioned the 15-day annual accrual for full-time employees\",\n", + " \"Agent mentioned the 2 business day advance notice requirement\",\n", + " ],\n", + " ),\n", + "\n", + " # --- Single-turn: 401k benefits ---\n", + " PredefinedScenario(\n", + " scenario_id=\"401k-info\",\n", + " turns=[\n", + " Turn(\n", + " input=\"How does the 401k match work?\",\n", + " expected_response=\"The company matches 100% of contributions up to 4% of salary, plus 50% on the next 2%, for a total effective match of up to 5%. The match vests over 3 years.\",\n", + " )\n", + " ],\n", + " expected_trajectory=[\"get_benefits_summary\"],\n", + " assertions=[\n", + " \"Agent called get_benefits_summary with benefit_type=401k\",\n", + " \"Agent correctly described the 4% full match and 50% match on next 2%\",\n", + " \"Agent mentioned the 3-year vesting schedule\",\n", + " ],\n", + " ),\n", + "\n", + " # --- Single-turn: check balance then submit PTO ---\n", + " PredefinedScenario(\n", + " scenario_id=\"check-and-submit-pto\",\n", + " turns=[\n", + " Turn(\n", + " input=\"Check the PTO balance for EMP-002, and if they have at least 2 days, submit a request for 2026-05-26 to 2026-05-27.\",\n", + " expected_response=\"EMP-002 has 3 remaining PTO days. PTO request submitted and approved for 2026-05-26 to 2026-05-27.\",\n", + " )\n", + " ],\n", + " expected_trajectory=[\"get_pto_balance\", \"submit_pto_request\"],\n", + " assertions=[\n", + " \"Agent first called get_pto_balance for EMP-002\",\n", + " \"Agent confirmed 3 remaining days is sufficient\",\n", + " \"Agent then called submit_pto_request for the correct dates\",\n", + " ],\n", + " ),\n", + "\n", + " # --- Multi-turn: benefits exploration ---\n", + " PredefinedScenario(\n", + " scenario_id=\"benefits-exploration\",\n", + " turns=[\n", + " Turn(\n", + " input=\"Can you walk me through the health insurance options?\",\n", + " expected_response=\"The company covers 90% of premiums for employee-only coverage. Three plans are available: Blue Shield PPO, Kaiser HMO, and HDHP with HSA.\",\n", + " ),\n", + " Turn(\n", + " input=\"What about dental?\",\n", + " expected_response=\"The dental plan covers 100% of preventive care, 80% of basic restorative care, and 50% of major work, with a $2,000 annual maximum.\",\n", + " ),\n", + " Turn(\n", + " input=\"And how much does the company contribute to the 401k?\",\n", + " expected_response=\"The company matches 100% up to 4% of salary, plus 50% on the next 2%, for a total effective match of up to 5%.\",\n", + " ),\n", + " ],\n", + " expected_trajectory=[\"get_benefits_summary\", \"get_benefits_summary\", \"get_benefits_summary\"],\n", + " assertions=[\n", + " \"Agent called get_benefits_summary three times across the conversation\",\n", + " \"Agent correctly described health, dental, and 401k benefits in their respective turns\",\n", + " \"Agent maintained conversational context across all three turns\",\n", + " ],\n", + " ),\n", + " ]\n", + ")\n", + "\n", + "print(f\"Dataset contains {len(dataset.scenarios)} scenarios.\")" + ] + }, + { + "cell_type": "markdown", + "id": "runner-config-md", + "metadata": {}, + "source": [ + "### 6b. Configure and Run OnDemandEvaluationDatasetRunner" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "runner-config", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:12:09.672709Z", + "iopub.status.busy": "2026-03-31T18:12:09.672562Z", + "iopub.status.idle": "2026-03-31T18:12:09.683717Z", + "shell.execute_reply": "2026-03-31T18:12:09.683225Z" + } + }, + "outputs": [], + "source": [ + "# Span collector: polls CloudWatch for OTel spans emitted by the agent\n", + "span_collector = CloudWatchAgentSpanCollector(\n", + " log_group_name=CW_LOG_GROUP,\n", + " region=REGION,\n", + " max_wait_seconds=180,\n", + " poll_interval_seconds=15,\n", + ")\n", + "\n", + "# Evaluator level cache — built-ins + custom evaluators\n", + "EVALUATOR_LEVELS = {\n", + " \"Builtin.GoalSuccessRate\": \"SESSION\",\n", + " \"Builtin.TrajectoryExactOrderMatch\": \"SESSION\",\n", + " \"Builtin.TrajectoryInOrderMatch\": \"SESSION\",\n", + " \"Builtin.TrajectoryAnyOrderMatch\": \"SESSION\",\n", + " \"Builtin.Correctness\": \"TRACE\",\n", + "}\n", + "# Custom evaluators (These are the custom evaluators we created for HR Response Similarity and HRAssertionChecker)\n", + "EVALUATOR_LEVELS[CUSTOM_RESPONSE_SIMILARITY_ID] = \"TRACE\"\n", + "EVALUATOR_LEVELS[CUSTOM_ASSERTION_CHECKER_ID] = \"SESSION\"\n", + "\n", + "# Evaluator configuration — mix of built-in and custom evaluators\n", + "config = EvaluationRunConfig(\n", + " evaluator_config=EvaluatorConfig(\n", + " evaluator_ids=[\n", + " \"Builtin.Correctness\", # TRACE — expected_response\n", + " \"Builtin.GoalSuccessRate\", # SESSION — assertions\n", + " \"Builtin.TrajectoryExactOrderMatch\", # SESSION — expected_trajectory\n", + " \"Builtin.TrajectoryInOrderMatch\", # SESSION — expected_trajectory\n", + " \"Builtin.TrajectoryAnyOrderMatch\", # SESSION — expected_trajectory \n", + " CUSTOM_RESPONSE_SIMILARITY_ID, # TRACE (custom) — {assistant_turn} + {expected_response}\n", + " CUSTOM_ASSERTION_CHECKER_ID, # SESSION (custom) — {actual_tool_trajectory} + {assertions}\n", + " ]\n", + " ),\n", + " evaluation_delay_seconds=180,\n", + " max_concurrent_scenarios=3,\n", + ")\n", + "\n", + "runner = OnDemandEvaluationDatasetRunner(region=REGION)\n", + "runner._evaluator_level_cache.update(EVALUATOR_LEVELS)\n", + "\n", + "print(\"OnDemandEvaluationDatasetRunner configured. Starting evaluation...\")\n", + "print(f\" Scenarios : {len(dataset.scenarios)}\")\n", + "print(f\" Evaluators: {len(config.evaluator_config.evaluator_ids)} \"\n", + " f\"(7 built-in + 2 custom)\")\n", + "print(f\" Delay : {config.evaluation_delay_seconds}s (waiting for CloudWatch ingestion)\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "runner-run", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T18:12:09.685148Z", + "iopub.status.busy": "2026-03-31T18:12:09.685047Z", + "iopub.status.idle": "2026-03-31T19:07:11.806861Z", + "shell.execute_reply": "2026-03-31T19:07:11.806186Z" + } + }, + "outputs": [], + "source": [ + "# Run the evaluation.\n", + "# OnDemandEvaluationDatasetRunner will:\n", + "# 1. Invoke agent_invoker for each turn in each scenario\n", + "# 2. Wait evaluation_delay_seconds for CloudWatch ingestion\n", + "# 3. Submit spans to the evaluation service\n", + "# 4. Return aggregated results\n", + "\n", + "eval_result = runner.run(\n", + " config=config,\n", + " dataset=dataset,\n", + " agent_invoker=agent_invoker,\n", + " span_collector=span_collector,\n", + ")\n", + "\n", + "completed = sum(1 for sr in eval_result.scenario_results if sr.status == \"COMPLETED\")\n", + "failed = sum(1 for sr in eval_result.scenario_results if sr.status == \"FAILED\")\n", + "print(f\"\\nEvaluation complete: {completed} completed, {failed} failed out of {len(eval_result.scenario_results)} scenarios.\")" + ] + }, + { + "cell_type": "markdown", + "id": "runner-results-md", + "metadata": {}, + "source": [ + "### 6c. Inspect Results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "runner-results", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T19:07:11.818123Z", + "iopub.status.busy": "2026-03-31T19:07:11.817895Z", + "iopub.status.idle": "2026-03-31T19:07:11.835130Z", + "shell.execute_reply": "2026-03-31T19:07:11.834370Z" + } + }, + "outputs": [], + "source": [ + "def display_runner_results(eval_result) -> None:\n", + " \"\"\"Display OnDemandEvaluationDatasetRunner results as a markdown table per scenario.\"\"\"\n", + " for sr in eval_result.scenario_results:\n", + " if sr.status == \"FAILED\":\n", + " display(Markdown(f\"**Scenario `{sr.scenario_id}`** — FAILED: {sr.error}\"))\n", + " continue\n", + "\n", + " rows = [\"| Evaluator | Value | Label | Explanation |\",\n", + " \"|---|---|---|---|\"]\n", + " for er in sr.evaluator_results:\n", + " for res in er.results:\n", + " value = str(res.get(\"value\", res.get(\"score\", \"N/A\")))\n", + " lbl = str(res.get(\"label\", res.get(\"rating\", \"\")))\n", + " explanation = (res.get(\"explanation\", \"\") or \"\")[:130].replace(\"\\n\", \" \")\n", + " error_code = res.get(\"errorCode\")\n", + " if error_code:\n", + " lbl = f\"ERR:{error_code}\"\n", + " explanation = (res.get(\"errorMessage\", \"\") or \"\")[:130]\n", + " rows.append(f\"| `{er.evaluator_id[:40]}` | {value} | {lbl} | {explanation} |\")\n", + "\n", + " md = f\"### Scenario: `{sr.scenario_id}`\\n\\n\" + \"\\n\".join(rows)\n", + " display(Markdown(md))\n", + "\n", + "\n", + "display_runner_results(eval_result)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "runner-summary", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T19:07:11.837290Z", + "iopub.status.busy": "2026-03-31T19:07:11.837118Z", + "iopub.status.idle": "2026-03-31T19:07:11.840760Z", + "shell.execute_reply": "2026-03-31T19:07:11.840171Z" + } + }, + "outputs": [], + "source": [ + "# Aggregate summary: average score per evaluator across all scenarios\n", + "from collections import defaultdict\n", + "\n", + "scores_by_evaluator = defaultdict(list)\n", + "for sr in eval_result.scenario_results:\n", + " if sr.status != \"COMPLETED\":\n", + " continue\n", + " for er in sr.evaluator_results:\n", + " for res in er.results:\n", + " if \"value\" in res and res[\"value\"] is not None and not res.get(\"errorCode\"):\n", + " scores_by_evaluator[er.evaluator_id].append(float(res[\"value\"]))\n", + "\n", + "print(\"\\nEvaluator Summary (average score across all scenarios)\")\n", + "print(\"=\" * 60)\n", + "for evaluator_id, scores in sorted(scores_by_evaluator.items()):\n", + " avg = sum(scores) / len(scores)\n", + " print(f\" {evaluator_id:<45} avg={avg:.2f} (n={len(scores)})\")" + ] + }, + { + "cell_type": "markdown", + "id": "save-results-md", + "metadata": {}, + "source": [ + "### 6d. Save Results to File" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "save-results", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T19:07:11.842441Z", + "iopub.status.busy": "2026-03-31T19:07:11.842317Z", + "iopub.status.idle": "2026-03-31T19:07:11.851164Z", + "shell.execute_reply": "2026-03-31T19:07:11.850695Z" + } + }, + "outputs": [], + "source": [ + "import os\n", + "from datetime import datetime\n", + "\n", + "os.makedirs(\"results\", exist_ok=True)\n", + "timestamp = datetime.utcnow().strftime(\"%Y%m%d_%H%M%S\")\n", + "results_path = f\"results/groundtruth_eval_{timestamp}.json\"\n", + "\n", + "with open(results_path, \"w\") as f:\n", + " json.dump(eval_result.model_dump(), f, indent=2, default=str)\n", + "\n", + "print(f\"Results saved to: {results_path}\")" + ] + }, + { + "cell_type": "markdown", + "id": "cleanup-md", + "metadata": {}, + "source": [ + "## Step 7: Cleanup\n", + "\n", + "Delete the agent runtime endpoint when you're done to avoid ongoing costs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cleanup", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-31T19:07:11.852819Z", + "iopub.status.busy": "2026-03-31T19:07:11.852701Z", + "iopub.status.idle": "2026-03-31T19:07:11.855335Z", + "shell.execute_reply": "2026-03-31T19:07:11.854895Z" + } + }, + "outputs": [], + "source": [ + "# Uncomment to delete the agent runtime\n", + "# agent_runtime.delete()\n", + "# print(\"Agent runtime deleted.\")\n", + "\n", + "print(\"Cleanup skipped. Uncomment the cell above to delete the agent runtime.\")" + ] + }, + { + "cell_type": "markdown", + "id": "next-steps-md", + "metadata": {}, + "source": [ + "### Key takeaways\n", + "\n", + "| | EvaluationClient | OnDemandEvaluationDatasetRunner |\n", + "|---|---|---|\n", + "| **When to use** | You have existing sessions | You have a test dataset |\n", + "| **Best for** | Post-hoc analysis, debugging | Regression testing, CI/CD |\n", + "| **Input** | session_id | Dataset of scenarios |\n", + "\n", + "### Built-in evaluator reference\n", + "\n", + "| Evaluator | Level | Ground truth required |\n", + "|---|---|---|\n", + "| `Builtin.Correctness` | TRACE | `expected_response` |\n", + "| `Builtin.GoalSuccessRate` | SESSION | `assertions` |\n", + "| `Builtin.TrajectoryExactOrderMatch` | SESSION | `expected_trajectory` |\n", + "| `Builtin.TrajectoryInOrderMatch` | SESSION | `expected_trajectory` |\n", + "| `Builtin.TrajectoryAnyOrderMatch` | SESSION | `expected_trajectory` |\n", + "\n", + "\n", + "### Custom evaluator ground truth placeholders\n", + "\n", + "Custom (LLM-as-a-judge) evaluators reference ground truth via placeholders in their `instructions`.\n", + "\n", + "| Level | Placeholder | Filled from |\n", + "|---|---|---|\n", + "| TRACE | `{assistant_turn}` | Agent's actual response |\n", + "| TRACE | `{expected_response}` | `ReferenceInputs.expected_response` |\n", + "| TRACE | `{context}` | Session context |\n", + "| SESSION | `{actual_tool_trajectory}` | Tools called by the agent |\n", + "| SESSION | `{expected_tool_trajectory}` | `ReferenceInputs.expected_trajectory` |\n", + "| SESSION | `{assertions}` | `ReferenceInputs.assertions` |\n", + "| SESSION | `{available_tools}` | Tools available to the agent |" + ] + }, + { + "cell_type": "markdown", + "id": "1932ba98", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/01-tutorials/07-AgentCore-evaluations/05-groundtruth-based-evalautions/requirements.txt b/01-tutorials/07-AgentCore-evaluations/05-groundtruth-based-evalautions/requirements.txt new file mode 100644 index 000000000..39492b397 --- /dev/null +++ b/01-tutorials/07-AgentCore-evaluations/05-groundtruth-based-evalautions/requirements.txt @@ -0,0 +1,6 @@ +bedrock-agentcore>=1.5.0 +bedrock-agentcore-starter-toolkit>=0.3.0 +boto3>=1.42.0 +strands-agents +strands-agents-tools +aws-opentelemetry-distro