CustomerBotBlitz

A 3-hour LLM-powered chatbot exercise for customer service bots: track orders, enforce 10-day cancellations, and evaluate with AISI's inspect.

Introduction

This repository contains a fully generative customer service chatbot built as a learning exercise. The bot uses Large Language Models (LLMs) via the OpenRouter API to handle customer requests while enforcing business policies, specifically a 10-day cancellation window for orders.

What This Repo Is

This project demonstrates how to build a production-ready chatbot that:

Integrates with LLMs via OpenRouter to generate natural language responses
Enforces business policies automatically (e.g., only canceling orders placed within 10 days)
Connects to APIs using structured tool/function calling to fetch order information and execute actions
Evaluates behavior using the AISI inspect framework to verify policy adherence and measure performance

The bot handles two primary use cases:

Order Tracking: Retrieve tracking information for customer orders
Order Cancellation: Process cancellation requests while enforcing the 10-day policy window

Project Context

This was built as a 3-hour exercise, so AI tools were used extensively during development. As a result, some areas may be rough around the edges, but the core functionality is complete and tested.

Key Features

🤖 Generative Bot: Uses LLM to generate natural, contextual responses
🔧 Tool-Based Architecture: Structured API access via function calling
📋 Policy Enforcement: Automatic validation of business rules (10-day cancellation window)
🧪 Comprehensive Testing: Unit tests with mocked APIs and live integration tests with real LLM
📊 Evaluation Framework: Scripted experiments using AISI inspect to verify bot behavior
📈 Performance Analysis: Report generation from experiment results with token usage and success metrics
💬 Interactive Chat: Live chat interface for manual testing and exploration

How to Use Each Feature

Prerequisites

Python 3.12+ and conda (or compatible environment manager)
OpenRouter API Key (for LLM integration)
- Get one at openrouter.ai
- Set as environment variable: export OPENROUTER_API_KEY='your-key-here'

Interactive Live Chat

The interactive chat interface lets you test the bot with real LLM calls in a conversational setting. It loads mock order data and provides helpful commands to inspect the database.

Start the chat:

export OPENROUTER_API_KEY='your-key-here'
./scripts/live_chat.sh

Available Commands:

/database - Display all orders with cancellation eligibility
/orders <customer_id> - Show all orders for a customer
/order <order_id> - Show detailed info for a specific order
/metrics - Display current session metrics (tokens, tool calls)
/reset - Reset conversation history and order state
/export - Export conversation to timestamped JSON file
/help - Show help message
/exit or /quit - Exit the chat session

Example Usage:

You: /database              # See all available orders
You: I want to cancel ORD-001
Bot: [Checks policy, validates ownership, cancels if eligible]
You: /metrics               # Check token usage

The chat uses real OpenRouter API calls, so responses are authentic but will consume API tokens.

Running Tests

The project includes comprehensive test suites for verifying bot behavior:

Unit Tests (Default) Run fast unit tests with fully mocked APIs (no external dependencies):

./run_tests.sh

These tests:

Mock API calls
Run quickly without external dependencies
Test bot logic, policy enforcement, and tool execution
Are suitable for CI/CD pipelines

Live Integration Tests Run tests with real OpenRouter API calls to verify actual LLM integration:

export OPENROUTER_API_KEY='your-key-here'
./run_tests.sh --live

Live tests:

Make real API calls to OpenRouter (consumes tokens)
Verify tool calling works with actual LLM
Automatically skip if OPENROUTER_API_KEY is not set

Test Coverage:

tests/test_api.py - Mock API endpoint tests
tests/test_bot.py - Bot logic tests (mocked LLM)
tests/test_live_integration.py - Live API integration tests

Running Experiments

The experiment framework uses AISI inspect to evaluate bot behavior across multiple test scenarios. Experiments run scripted test cases with playbook-driven assertions.

Quick Start:

export OPENROUTER_API_KEY='your-key-here'
./scripts/start_experiment.sh

This runs two experiment tasks:

Tracking Task - Evaluates order tracking requests (3 test cases)
Cancellation Task - Evaluates cancellation requests (4 test cases)

What Gets Tested:

Tool calling correctness (verifies correct tools are invoked)
Policy enforcement (bot correctly enforces 10-day cancellation window)
Response patterns (regex assertions for expected bot responses)
Token efficiency and tool usage metrics

Experiment Structure: Each test case uses a playbook with steps:

user - User message that triggers bot
inject_bot_response - Injects bot context
inject_tool - Injects tool results
expect_tool - Asserts specific tool was called
expect_bot_regex - Asserts bot response matches pattern

Output: Logs are saved to logs/experiments/ with JSON format for programmatic analysis.

Generating Reports

After running experiments, generate a consolidated markdown report:

./scripts/analyse_experiments.sh

This script:

Finds the latest tracking and cancellation experiment logs
Parses JSON results and extracts metrics
Generates logs/experiments/REPORT.md with:
- Overall summary (pass rates, token usage)
- Per-test-case results tables
- Token efficiency metrics

Report Contents:

Summary Statistics: Total tests, pass rates, average tokens per test
Tracking Results: Table of tracking test cases with pass/fail status
Cancellation Results: Table of cancellation test cases with pass/fail status
Token Metrics: Total and average token consumption per test type

The report helps you quickly assess bot performance and identify any failure patterns.

View Individual Test Cases: To inspect detailed results for each test case, use AISI's interactive viewer:

conda activate customerservice
aisi view --log-dir logs/

Design of the Repo

Architecture Overview

The project follows a layered architecture with clear separation of concerns:

┌─────────────────────────────────────────────────────────┐
│  User Interface Layer                                    │
│  - Interactive Chat (scripts/live_chat.py)               │
│  - Experiment Runner (scripts/start_experiment.sh)        │
└──────────────────┬──────────────────────────────────────┘
                    │
┌───────────────────▼──────────────────────────────────────┐
│  Bot Layer (bot/)                                         │
│  - CustomerServiceBot (orchestrates LLM + tools)         │
│  - LLMClient (OpenRouter API wrapper)                     │
│  - ToolExecutor (executes API calls via tools)            │
│  - PolicyChecker (enforces business rules)                │
└────────────┬───────────────────────┬──────────────────────┘
             │                       │
┌────────────▼──────────┐  ┌─────────▼──────────────┐
│  API Layer (api/)     │  │  Testing/Evaluation    │
│  - FastAPI endpoints  │  │  - Unit tests          │
│  - Order models       │  │  - Live tests          │
│  - Mock data          │  │  - Experiments        │
└───────────────────────┘  └───────────────────────┘

The bot uses a tool-based architecture where the LLM decides which tools to call based on user requests. Tools provide structured access to the order API, while policies ensure business rules are enforced before actions are taken.

Core Components

API Layer (`api/`)

The API layer provides FastAPI endpoints that the bot calls via tools:

models.py: Data models (Order, OrderStatus, etc.)
order_api.py: FastAPI application with endpoints:
- GET /orders/{order_id} - Retrieve order details
- GET /orders/{order_id}/tracking - Get tracking information
- POST /orders/{order_id}/cancel - Cancel an order
- GET /customers/{customer_id}/orders - List customer orders

The API enforces the 10-day cancellation policy and validates order states before allowing cancellations. In production, this would connect to a real database, but here it uses mock data from tests/mock_data.py.

Bot Layer (`bot/`)

The bot layer orchestrates LLM interactions with structured tool access:

bot.py: Main CustomerServiceBot class
- Manages conversation history
- Implements multi-round tool calling loop
- Tracks session metrics (tokens, tool calls)
- Maintains tool call log for evaluation
llm_client.py: Wraps OpenRouter API
- Handles OpenAI-compatible API calls
- Formats tool definitions for function calling
- Manages API key and model configuration
tools.py: Tool definitions and execution
- Defines tool schemas (track_order, cancel_order, etc.)
- ToolExecutor executes tools by calling order API
- Formats tool results for LLM consumption
policies.py: Centralized policy enforcement
- PolicyChecker class with 10-day cancellation window
- Validates customer ownership
- Provides policy summaries for system prompts

Testing Infrastructure (`tests/`)

Comprehensive test coverage with multiple strategies:

test_api.py: FastAPI endpoint tests
test_bot.py: Bot logic tests with mocked LLM (36 unit tests)
test_live_integration.py: Live API tests with real OpenRouter
api_mocking.py: Shared mocking utilities for consistent test setup
mock_data.py: 15 sample orders with varied ages and statuses

Tests use unittest.mock to isolate components and ensure fast, reliable execution without external dependencies.

Experiment Framework (`experiment/`)

Implemented a minimal set of tests to check essential behaviour. Scripted evaluation using AISI inspect framework:

scripted_scenarios.py: Playbook-driven test cases
- tracking_task() - 3 test cases for order tracking
- cancellation_task() - 4 test cases for cancellations
- Each case uses a playbook with user steps and assertions
Test Structure: Each test case defines:
- Input scenario (user message, order context)
- Expected behavior (tool calls, responses)
- Assertions (tool was called, regex pattern matched)

Experiments generate structured JSON logs for programmatic analysis and reporting.

Scripts (`scripts/`)

Utility scripts for setup and execution:

env_setup.sh: Shared conda environment setup
- Creates customerservice conda environment
- Installs dependencies from requirements.txt
- Handles environment activation
live_chat.sh: Launcher for interactive chat
start_experiment.sh: Runs tracking and cancellation experiments
analyse_experiments.sh: Generates markdown report from experiment logs
live_chat.py: Interactive REPL with database inspection commands

Tool Calling Flow

When a user sends a message, here's what happens:

User Input: User message added to conversation history
LLM Processing: LLM analyzes message and conversation context
Tool Decision: LLM decides if tools are needed and which ones
Tool Execution: If tools requested:
- ToolExecutor calls order API endpoints
- Results formatted and added to conversation
- Process repeats (multi-round calling)
Final Response: LLM produces text-only response when task complete
Metrics Tracking: Tokens and tool calls recorded

Example Flow (Cancellation):

User: "I want to cancel order ORD-001"
  ↓
Bot: [LLM decides to call validate_customer_ownership]
  ↓
Tool: validate_customer_ownership(ORD-001) → "Customer verified"
  ↓
Bot: [LLM decides to call check_cancellation_policy]
  ↓
Tool: check_cancellation_policy(ORD-001) → "Eligible (3 days old)"
  ↓
Bot: [LLM decides to call cancel_order]
  ↓
Tool: cancel_order(ORD-001) → "Cancelled successfully"
  ↓
Bot: "I've successfully cancelled your order ORD-001..."

The bot continues until it reaches a natural stopping point or hits the round limit.

Policy Enforcement

Policies are enforced at multiple layers for security and correctness:

1. API Level (api/order_api.py)

FastAPI endpoints validate order state before cancellation
Checks age, status, and ownership before processing

2. Policy Layer (bot/policies.py)

PolicyChecker provides centralized policy logic
Methods: can_cancel_order(), validate_customer_ownership()
Returns clear reasons for policy violations

3. Bot Level (bot/bot.py)

System prompt instructs LLM to use policy checking tools
Bot is instructed to always check policy before canceling
Tool execution ensures policies are enforced even if LLM forgets

4. Tool Level (bot/tools.py)

check_cancellation_policy tool explicitly validates eligibility
Tool results inform LLM about policy compliance
Forces LLM to check policy rather than guess

Current Policy: 10-Day Cancellation Window

Orders placed ≤10 days ago: Eligible for cancellation
Orders >10 days old: Not eligible (policy violation)
Already cancelled: Cannot cancel again
Delivered orders: Cannot cancel

Evaluation Strategy

The evaluation strategy uses scripted scenarios with single-assertion checks:

Test Structure:

Each test case has a playbook defining the interaction flow
Assertions check specific events (tool calls, response patterns)
Scoring verifies assertions passed (1.0) or failed (0.0)

Metrics Collected:

Success Rate: Percentage of test cases passing
Token Usage: Total and average tokens per test
Tool Calls: Number of tool invocations per test
Response Patterns: Regex matches for expected bot behavior

Evaluation Types:

Tracking Evaluation: Verifies bot correctly retrieves tracking info
Cancellation Evaluation: Verifies bot enforces policies correctly

Scoring Logic:

Checks if asserted event occurred (tool was called, regex matched)
No partial credit - test either passes or fails
Focuses on behavioral correctness rather than response quality

Report Generation:

Consolidates results from multiple experiment runs
Provides human-readable markdown report
Includes interpretation of success rates and efficiency metrics

This strategy prioritizes reliability and policy adherence over response quality, ensuring the bot behaves correctly in production scenarios.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
api		api
bot		bot
eval		eval
experiment		experiment
logs/experiments		logs/experiments
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
LIVE_CHAT_GUIDE.md		LIVE_CHAT_GUIDE.md
LIVE_TESTS_README.md		LIVE_TESTS_README.md
README.md		README.md
TESTING.md		TESTING.md
plan.md		plan.md
requirements.txt		requirements.txt
run_tests.sh		run_tests.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CustomerBotBlitz

Introduction

What This Repo Is

Project Context

Key Features

How to Use Each Feature

Prerequisites

Interactive Live Chat

Running Tests

Running Experiments

Generating Reports

Design of the Repo

Architecture Overview

Core Components

API Layer (`api/`)

Bot Layer (`bot/`)

Testing Infrastructure (`tests/`)

Experiment Framework (`experiment/`)

Scripts (`scripts/`)

Tool Calling Flow

Policy Enforcement

Evaluation Strategy

About

Uh oh!

Releases

Packages

Languages

License

ganileni/CustomerBotBlitz

Folders and files

Latest commit

History

Repository files navigation

CustomerBotBlitz

Introduction

What This Repo Is

Project Context

Key Features

How to Use Each Feature

Prerequisites

Interactive Live Chat

Running Tests

Running Experiments

Generating Reports

Design of the Repo

Architecture Overview

Core Components

API Layer (api/)

Bot Layer (bot/)

Testing Infrastructure (tests/)

Experiment Framework (experiment/)

Scripts (scripts/)

Tool Calling Flow

Policy Enforcement

Evaluation Strategy

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

API Layer (`api/`)

Bot Layer (`bot/`)

Testing Infrastructure (`tests/`)

Experiment Framework (`experiment/`)

Scripts (`scripts/`)

Packages