A 3-hour LLM-powered chatbot exercise for customer service bots: track orders, enforce 10-day cancellations, and evaluate with AISI's inspect.
This repository contains a fully generative customer service chatbot built as a learning exercise. The bot uses Large Language Models (LLMs) via the OpenRouter API to handle customer requests while enforcing business policies, specifically a 10-day cancellation window for orders.
This project demonstrates how to build a production-ready chatbot that:
- Integrates with LLMs via OpenRouter to generate natural language responses
- Enforces business policies automatically (e.g., only canceling orders placed within 10 days)
- Connects to APIs using structured tool/function calling to fetch order information and execute actions
- Evaluates behavior using the AISI inspect framework to verify policy adherence and measure performance
The bot handles two primary use cases:
- Order Tracking: Retrieve tracking information for customer orders
- Order Cancellation: Process cancellation requests while enforcing the 10-day policy window
This was built as a 3-hour exercise, so AI tools were used extensively during development. As a result, some areas may be rough around the edges, but the core functionality is complete and tested.
- π€ Generative Bot: Uses LLM to generate natural, contextual responses
- π§ Tool-Based Architecture: Structured API access via function calling
- π Policy Enforcement: Automatic validation of business rules (10-day cancellation window)
- π§ͺ Comprehensive Testing: Unit tests with mocked APIs and live integration tests with real LLM
- π Evaluation Framework: Scripted experiments using AISI inspect to verify bot behavior
- π Performance Analysis: Report generation from experiment results with token usage and success metrics
- π¬ Interactive Chat: Live chat interface for manual testing and exploration
- Python 3.12+ and conda (or compatible environment manager)
- OpenRouter API Key (for LLM integration)
- Get one at openrouter.ai
- Set as environment variable:
export OPENROUTER_API_KEY='your-key-here'
The interactive chat interface lets you test the bot with real LLM calls in a conversational setting. It loads mock order data and provides helpful commands to inspect the database.
Start the chat:
export OPENROUTER_API_KEY='your-key-here'
./scripts/live_chat.shAvailable Commands:
/database- Display all orders with cancellation eligibility/orders <customer_id>- Show all orders for a customer/order <order_id>- Show detailed info for a specific order/metrics- Display current session metrics (tokens, tool calls)/reset- Reset conversation history and order state/export- Export conversation to timestamped JSON file/help- Show help message/exitor/quit- Exit the chat session
Example Usage:
You: /database # See all available orders
You: I want to cancel ORD-001
Bot: [Checks policy, validates ownership, cancels if eligible]
You: /metrics # Check token usageThe chat uses real OpenRouter API calls, so responses are authentic but will consume API tokens.
The project includes comprehensive test suites for verifying bot behavior:
Unit Tests (Default) Run fast unit tests with fully mocked APIs (no external dependencies):
./run_tests.shThese tests:
- Mock API calls
- Run quickly without external dependencies
- Test bot logic, policy enforcement, and tool execution
- Are suitable for CI/CD pipelines
Live Integration Tests Run tests with real OpenRouter API calls to verify actual LLM integration:
export OPENROUTER_API_KEY='your-key-here'
./run_tests.sh --liveLive tests:
- Make real API calls to OpenRouter (consumes tokens)
- Verify tool calling works with actual LLM
- Automatically skip if
OPENROUTER_API_KEYis not set
Test Coverage:
tests/test_api.py- Mock API endpoint teststests/test_bot.py- Bot logic tests (mocked LLM)tests/test_live_integration.py- Live API integration tests
The experiment framework uses AISI inspect to evaluate bot behavior across multiple test scenarios. Experiments run scripted test cases with playbook-driven assertions.
Quick Start:
export OPENROUTER_API_KEY='your-key-here'
./scripts/start_experiment.shThis runs two experiment tasks:
- Tracking Task - Evaluates order tracking requests (3 test cases)
- Cancellation Task - Evaluates cancellation requests (4 test cases)
What Gets Tested:
- Tool calling correctness (verifies correct tools are invoked)
- Policy enforcement (bot correctly enforces 10-day cancellation window)
- Response patterns (regex assertions for expected bot responses)
- Token efficiency and tool usage metrics
Experiment Structure: Each test case uses a playbook with steps:
user- User message that triggers botinject_bot_response- Injects bot contextinject_tool- Injects tool resultsexpect_tool- Asserts specific tool was calledexpect_bot_regex- Asserts bot response matches pattern
Output:
Logs are saved to logs/experiments/ with JSON format for programmatic analysis.
After running experiments, generate a consolidated markdown report:
./scripts/analyse_experiments.shThis script:
- Finds the latest tracking and cancellation experiment logs
- Parses JSON results and extracts metrics
- Generates
logs/experiments/REPORT.mdwith:- Overall summary (pass rates, token usage)
- Per-test-case results tables
- Token efficiency metrics
Report Contents:
- Summary Statistics: Total tests, pass rates, average tokens per test
- Tracking Results: Table of tracking test cases with pass/fail status
- Cancellation Results: Table of cancellation test cases with pass/fail status
- Token Metrics: Total and average token consumption per test type
The report helps you quickly assess bot performance and identify any failure patterns.
View Individual Test Cases: To inspect detailed results for each test case, use AISI's interactive viewer:
conda activate customerservice
aisi view --log-dir logs/The project follows a layered architecture with clear separation of concerns:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Interface Layer β
β - Interactive Chat (scripts/live_chat.py) β
β - Experiment Runner (scripts/start_experiment.sh) β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β Bot Layer (bot/) β
β - CustomerServiceBot (orchestrates LLM + tools) β
β - LLMClient (OpenRouter API wrapper) β
β - ToolExecutor (executes API calls via tools) β
β - PolicyChecker (enforces business rules) β
ββββββββββββββ¬ββββββββββββββββββββββββ¬βββββββββββββββββββββββ
β β
ββββββββββββββΌβββββββββββ βββββββββββΌβββββββββββββββ
β API Layer (api/) β β Testing/Evaluation β
β - FastAPI endpoints β β - Unit tests β
β - Order models β β - Live tests β
β - Mock data β β - Experiments β
βββββββββββββββββββββββββ βββββββββββββββββββββββββ
The bot uses a tool-based architecture where the LLM decides which tools to call based on user requests. Tools provide structured access to the order API, while policies ensure business rules are enforced before actions are taken.
The API layer provides FastAPI endpoints that the bot calls via tools:
models.py: Data models (Order,OrderStatus, etc.)order_api.py: FastAPI application with endpoints:GET /orders/{order_id}- Retrieve order detailsGET /orders/{order_id}/tracking- Get tracking informationPOST /orders/{order_id}/cancel- Cancel an orderGET /customers/{customer_id}/orders- List customer orders
The API enforces the 10-day cancellation policy and validates order states before allowing cancellations. In production, this would connect to a real database, but here it uses mock data from tests/mock_data.py.
The bot layer orchestrates LLM interactions with structured tool access:
-
bot.py: MainCustomerServiceBotclass- Manages conversation history
- Implements multi-round tool calling loop
- Tracks session metrics (tokens, tool calls)
- Maintains tool call log for evaluation
-
llm_client.py: Wraps OpenRouter API- Handles OpenAI-compatible API calls
- Formats tool definitions for function calling
- Manages API key and model configuration
-
tools.py: Tool definitions and execution- Defines tool schemas (track_order, cancel_order, etc.)
ToolExecutorexecutes tools by calling order API- Formats tool results for LLM consumption
-
policies.py: Centralized policy enforcementPolicyCheckerclass with 10-day cancellation window- Validates customer ownership
- Provides policy summaries for system prompts
Comprehensive test coverage with multiple strategies:
test_api.py: FastAPI endpoint teststest_bot.py: Bot logic tests with mocked LLM (36 unit tests)test_live_integration.py: Live API tests with real OpenRouterapi_mocking.py: Shared mocking utilities for consistent test setupmock_data.py: 15 sample orders with varied ages and statuses
Tests use unittest.mock to isolate components and ensure fast, reliable execution without external dependencies.
Implemented a minimal set of tests to check essential behaviour. Scripted evaluation using AISI inspect framework:
-
scripted_scenarios.py: Playbook-driven test casestracking_task()- 3 test cases for order trackingcancellation_task()- 4 test cases for cancellations- Each case uses a playbook with user steps and assertions
-
Test Structure: Each test case defines:
- Input scenario (user message, order context)
- Expected behavior (tool calls, responses)
- Assertions (tool was called, regex pattern matched)
Experiments generate structured JSON logs for programmatic analysis and reporting.
Utility scripts for setup and execution:
-
env_setup.sh: Shared conda environment setup- Creates
customerserviceconda environment - Installs dependencies from
requirements.txt - Handles environment activation
- Creates
-
live_chat.sh: Launcher for interactive chat -
start_experiment.sh: Runs tracking and cancellation experiments -
analyse_experiments.sh: Generates markdown report from experiment logs -
live_chat.py: Interactive REPL with database inspection commands
When a user sends a message, here's what happens:
- User Input: User message added to conversation history
- LLM Processing: LLM analyzes message and conversation context
- Tool Decision: LLM decides if tools are needed and which ones
- Tool Execution: If tools requested:
ToolExecutorcalls order API endpoints- Results formatted and added to conversation
- Process repeats (multi-round calling)
- Final Response: LLM produces text-only response when task complete
- Metrics Tracking: Tokens and tool calls recorded
Example Flow (Cancellation):
User: "I want to cancel order ORD-001"
β
Bot: [LLM decides to call validate_customer_ownership]
β
Tool: validate_customer_ownership(ORD-001) β "Customer verified"
β
Bot: [LLM decides to call check_cancellation_policy]
β
Tool: check_cancellation_policy(ORD-001) β "Eligible (3 days old)"
β
Bot: [LLM decides to call cancel_order]
β
Tool: cancel_order(ORD-001) β "Cancelled successfully"
β
Bot: "I've successfully cancelled your order ORD-001..."
The bot continues until it reaches a natural stopping point or hits the round limit.
Policies are enforced at multiple layers for security and correctness:
1. API Level (api/order_api.py)
- FastAPI endpoints validate order state before cancellation
- Checks age, status, and ownership before processing
2. Policy Layer (bot/policies.py)
PolicyCheckerprovides centralized policy logic- Methods:
can_cancel_order(),validate_customer_ownership() - Returns clear reasons for policy violations
3. Bot Level (bot/bot.py)
- System prompt instructs LLM to use policy checking tools
- Bot is instructed to always check policy before canceling
- Tool execution ensures policies are enforced even if LLM forgets
4. Tool Level (bot/tools.py)
check_cancellation_policytool explicitly validates eligibility- Tool results inform LLM about policy compliance
- Forces LLM to check policy rather than guess
Current Policy: 10-Day Cancellation Window
- Orders placed β€10 days ago: Eligible for cancellation
- Orders >10 days old: Not eligible (policy violation)
- Already cancelled: Cannot cancel again
- Delivered orders: Cannot cancel
The evaluation strategy uses scripted scenarios with single-assertion checks:
Test Structure:
- Each test case has a playbook defining the interaction flow
- Assertions check specific events (tool calls, response patterns)
- Scoring verifies assertions passed (1.0) or failed (0.0)
Metrics Collected:
- Success Rate: Percentage of test cases passing
- Token Usage: Total and average tokens per test
- Tool Calls: Number of tool invocations per test
- Response Patterns: Regex matches for expected bot behavior
Evaluation Types:
- Tracking Evaluation: Verifies bot correctly retrieves tracking info
- Cancellation Evaluation: Verifies bot enforces policies correctly
Scoring Logic:
- Checks if asserted event occurred (tool was called, regex matched)
- No partial credit - test either passes or fails
- Focuses on behavioral correctness rather than response quality
Report Generation:
- Consolidates results from multiple experiment runs
- Provides human-readable markdown report
- Includes interpretation of success rates and efficiency metrics
This strategy prioritizes reliability and policy adherence over response quality, ensuring the bot behaves correctly in production scenarios.