Skip to content

A 3-hour LLM-powered chatbot exercise for customer service bots: track orders, enforce 10-day cancellations, and evaluate with AISI's inspect.

License

Notifications You must be signed in to change notification settings

ganileni/CustomerBotBlitz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CustomerBotBlitz

A 3-hour LLM-powered chatbot exercise for customer service bots: track orders, enforce 10-day cancellations, and evaluate with AISI's inspect.

Introduction

This repository contains a fully generative customer service chatbot built as a learning exercise. The bot uses Large Language Models (LLMs) via the OpenRouter API to handle customer requests while enforcing business policies, specifically a 10-day cancellation window for orders.

What This Repo Is

This project demonstrates how to build a production-ready chatbot that:

  • Integrates with LLMs via OpenRouter to generate natural language responses
  • Enforces business policies automatically (e.g., only canceling orders placed within 10 days)
  • Connects to APIs using structured tool/function calling to fetch order information and execute actions
  • Evaluates behavior using the AISI inspect framework to verify policy adherence and measure performance

The bot handles two primary use cases:

  1. Order Tracking: Retrieve tracking information for customer orders
  2. Order Cancellation: Process cancellation requests while enforcing the 10-day policy window

Project Context

This was built as a 3-hour exercise, so AI tools were used extensively during development. As a result, some areas may be rough around the edges, but the core functionality is complete and tested.

Key Features

  • πŸ€– Generative Bot: Uses LLM to generate natural, contextual responses
  • πŸ”§ Tool-Based Architecture: Structured API access via function calling
  • πŸ“‹ Policy Enforcement: Automatic validation of business rules (10-day cancellation window)
  • πŸ§ͺ Comprehensive Testing: Unit tests with mocked APIs and live integration tests with real LLM
  • πŸ“Š Evaluation Framework: Scripted experiments using AISI inspect to verify bot behavior
  • πŸ“ˆ Performance Analysis: Report generation from experiment results with token usage and success metrics
  • πŸ’¬ Interactive Chat: Live chat interface for manual testing and exploration

How to Use Each Feature

Prerequisites

  • Python 3.12+ and conda (or compatible environment manager)
  • OpenRouter API Key (for LLM integration)
    • Get one at openrouter.ai
    • Set as environment variable: export OPENROUTER_API_KEY='your-key-here'

Interactive Live Chat

The interactive chat interface lets you test the bot with real LLM calls in a conversational setting. It loads mock order data and provides helpful commands to inspect the database.

Start the chat:

export OPENROUTER_API_KEY='your-key-here'
./scripts/live_chat.sh

Available Commands:

  • /database - Display all orders with cancellation eligibility
  • /orders <customer_id> - Show all orders for a customer
  • /order <order_id> - Show detailed info for a specific order
  • /metrics - Display current session metrics (tokens, tool calls)
  • /reset - Reset conversation history and order state
  • /export - Export conversation to timestamped JSON file
  • /help - Show help message
  • /exit or /quit - Exit the chat session

Example Usage:

You: /database              # See all available orders
You: I want to cancel ORD-001
Bot: [Checks policy, validates ownership, cancels if eligible]
You: /metrics               # Check token usage

The chat uses real OpenRouter API calls, so responses are authentic but will consume API tokens.

Running Tests

The project includes comprehensive test suites for verifying bot behavior:

Unit Tests (Default) Run fast unit tests with fully mocked APIs (no external dependencies):

./run_tests.sh

These tests:

  • Mock API calls
  • Run quickly without external dependencies
  • Test bot logic, policy enforcement, and tool execution
  • Are suitable for CI/CD pipelines

Live Integration Tests Run tests with real OpenRouter API calls to verify actual LLM integration:

export OPENROUTER_API_KEY='your-key-here'
./run_tests.sh --live

Live tests:

  • Make real API calls to OpenRouter (consumes tokens)
  • Verify tool calling works with actual LLM
  • Automatically skip if OPENROUTER_API_KEY is not set

Test Coverage:

  • tests/test_api.py - Mock API endpoint tests
  • tests/test_bot.py - Bot logic tests (mocked LLM)
  • tests/test_live_integration.py - Live API integration tests

Running Experiments

The experiment framework uses AISI inspect to evaluate bot behavior across multiple test scenarios. Experiments run scripted test cases with playbook-driven assertions.

Quick Start:

export OPENROUTER_API_KEY='your-key-here'
./scripts/start_experiment.sh

This runs two experiment tasks:

  1. Tracking Task - Evaluates order tracking requests (3 test cases)
  2. Cancellation Task - Evaluates cancellation requests (4 test cases)

What Gets Tested:

  • Tool calling correctness (verifies correct tools are invoked)
  • Policy enforcement (bot correctly enforces 10-day cancellation window)
  • Response patterns (regex assertions for expected bot responses)
  • Token efficiency and tool usage metrics

Experiment Structure: Each test case uses a playbook with steps:

  • user - User message that triggers bot
  • inject_bot_response - Injects bot context
  • inject_tool - Injects tool results
  • expect_tool - Asserts specific tool was called
  • expect_bot_regex - Asserts bot response matches pattern

Output: Logs are saved to logs/experiments/ with JSON format for programmatic analysis.

Generating Reports

After running experiments, generate a consolidated markdown report:

./scripts/analyse_experiments.sh

This script:

  • Finds the latest tracking and cancellation experiment logs
  • Parses JSON results and extracts metrics
  • Generates logs/experiments/REPORT.md with:
    • Overall summary (pass rates, token usage)
    • Per-test-case results tables
    • Token efficiency metrics

Report Contents:

  • Summary Statistics: Total tests, pass rates, average tokens per test
  • Tracking Results: Table of tracking test cases with pass/fail status
  • Cancellation Results: Table of cancellation test cases with pass/fail status
  • Token Metrics: Total and average token consumption per test type

The report helps you quickly assess bot performance and identify any failure patterns.

View Individual Test Cases: To inspect detailed results for each test case, use AISI's interactive viewer:

conda activate customerservice
aisi view --log-dir logs/

Design of the Repo

Architecture Overview

The project follows a layered architecture with clear separation of concerns:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  User Interface Layer                                    β”‚
β”‚  - Interactive Chat (scripts/live_chat.py)               β”‚
β”‚  - Experiment Runner (scripts/start_experiment.sh)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Bot Layer (bot/)                                         β”‚
β”‚  - CustomerServiceBot (orchestrates LLM + tools)         β”‚
β”‚  - LLMClient (OpenRouter API wrapper)                     β”‚
β”‚  - ToolExecutor (executes API calls via tools)            β”‚
β”‚  - PolicyChecker (enforces business rules)                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  API Layer (api/)     β”‚  β”‚  Testing/Evaluation    β”‚
β”‚  - FastAPI endpoints  β”‚  β”‚  - Unit tests          β”‚
β”‚  - Order models       β”‚  β”‚  - Live tests          β”‚
β”‚  - Mock data          β”‚  β”‚  - Experiments        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The bot uses a tool-based architecture where the LLM decides which tools to call based on user requests. Tools provide structured access to the order API, while policies ensure business rules are enforced before actions are taken.

Core Components

API Layer (api/)

The API layer provides FastAPI endpoints that the bot calls via tools:

  • models.py: Data models (Order, OrderStatus, etc.)
  • order_api.py: FastAPI application with endpoints:
    • GET /orders/{order_id} - Retrieve order details
    • GET /orders/{order_id}/tracking - Get tracking information
    • POST /orders/{order_id}/cancel - Cancel an order
    • GET /customers/{customer_id}/orders - List customer orders

The API enforces the 10-day cancellation policy and validates order states before allowing cancellations. In production, this would connect to a real database, but here it uses mock data from tests/mock_data.py.

Bot Layer (bot/)

The bot layer orchestrates LLM interactions with structured tool access:

  • bot.py: Main CustomerServiceBot class

    • Manages conversation history
    • Implements multi-round tool calling loop
    • Tracks session metrics (tokens, tool calls)
    • Maintains tool call log for evaluation
  • llm_client.py: Wraps OpenRouter API

    • Handles OpenAI-compatible API calls
    • Formats tool definitions for function calling
    • Manages API key and model configuration
  • tools.py: Tool definitions and execution

    • Defines tool schemas (track_order, cancel_order, etc.)
    • ToolExecutor executes tools by calling order API
    • Formats tool results for LLM consumption
  • policies.py: Centralized policy enforcement

    • PolicyChecker class with 10-day cancellation window
    • Validates customer ownership
    • Provides policy summaries for system prompts

Testing Infrastructure (tests/)

Comprehensive test coverage with multiple strategies:

  • test_api.py: FastAPI endpoint tests
  • test_bot.py: Bot logic tests with mocked LLM (36 unit tests)
  • test_live_integration.py: Live API tests with real OpenRouter
  • api_mocking.py: Shared mocking utilities for consistent test setup
  • mock_data.py: 15 sample orders with varied ages and statuses

Tests use unittest.mock to isolate components and ensure fast, reliable execution without external dependencies.

Experiment Framework (experiment/)

Implemented a minimal set of tests to check essential behaviour. Scripted evaluation using AISI inspect framework:

  • scripted_scenarios.py: Playbook-driven test cases

    • tracking_task() - 3 test cases for order tracking
    • cancellation_task() - 4 test cases for cancellations
    • Each case uses a playbook with user steps and assertions
  • Test Structure: Each test case defines:

    • Input scenario (user message, order context)
    • Expected behavior (tool calls, responses)
    • Assertions (tool was called, regex pattern matched)

Experiments generate structured JSON logs for programmatic analysis and reporting.

Scripts (scripts/)

Utility scripts for setup and execution:

  • env_setup.sh: Shared conda environment setup

    • Creates customerservice conda environment
    • Installs dependencies from requirements.txt
    • Handles environment activation
  • live_chat.sh: Launcher for interactive chat

  • start_experiment.sh: Runs tracking and cancellation experiments

  • analyse_experiments.sh: Generates markdown report from experiment logs

  • live_chat.py: Interactive REPL with database inspection commands

Tool Calling Flow

When a user sends a message, here's what happens:

  1. User Input: User message added to conversation history
  2. LLM Processing: LLM analyzes message and conversation context
  3. Tool Decision: LLM decides if tools are needed and which ones
  4. Tool Execution: If tools requested:
    • ToolExecutor calls order API endpoints
    • Results formatted and added to conversation
    • Process repeats (multi-round calling)
  5. Final Response: LLM produces text-only response when task complete
  6. Metrics Tracking: Tokens and tool calls recorded

Example Flow (Cancellation):

User: "I want to cancel order ORD-001"
  ↓
Bot: [LLM decides to call validate_customer_ownership]
  ↓
Tool: validate_customer_ownership(ORD-001) β†’ "Customer verified"
  ↓
Bot: [LLM decides to call check_cancellation_policy]
  ↓
Tool: check_cancellation_policy(ORD-001) β†’ "Eligible (3 days old)"
  ↓
Bot: [LLM decides to call cancel_order]
  ↓
Tool: cancel_order(ORD-001) β†’ "Cancelled successfully"
  ↓
Bot: "I've successfully cancelled your order ORD-001..."

The bot continues until it reaches a natural stopping point or hits the round limit.

Policy Enforcement

Policies are enforced at multiple layers for security and correctness:

1. API Level (api/order_api.py)

  • FastAPI endpoints validate order state before cancellation
  • Checks age, status, and ownership before processing

2. Policy Layer (bot/policies.py)

  • PolicyChecker provides centralized policy logic
  • Methods: can_cancel_order(), validate_customer_ownership()
  • Returns clear reasons for policy violations

3. Bot Level (bot/bot.py)

  • System prompt instructs LLM to use policy checking tools
  • Bot is instructed to always check policy before canceling
  • Tool execution ensures policies are enforced even if LLM forgets

4. Tool Level (bot/tools.py)

  • check_cancellation_policy tool explicitly validates eligibility
  • Tool results inform LLM about policy compliance
  • Forces LLM to check policy rather than guess

Current Policy: 10-Day Cancellation Window

  • Orders placed ≀10 days ago: Eligible for cancellation
  • Orders >10 days old: Not eligible (policy violation)
  • Already cancelled: Cannot cancel again
  • Delivered orders: Cannot cancel

Evaluation Strategy

The evaluation strategy uses scripted scenarios with single-assertion checks:

Test Structure:

  • Each test case has a playbook defining the interaction flow
  • Assertions check specific events (tool calls, response patterns)
  • Scoring verifies assertions passed (1.0) or failed (0.0)

Metrics Collected:

  • Success Rate: Percentage of test cases passing
  • Token Usage: Total and average tokens per test
  • Tool Calls: Number of tool invocations per test
  • Response Patterns: Regex matches for expected bot behavior

Evaluation Types:

  1. Tracking Evaluation: Verifies bot correctly retrieves tracking info
  2. Cancellation Evaluation: Verifies bot enforces policies correctly

Scoring Logic:

  • Checks if asserted event occurred (tool was called, regex matched)
  • No partial credit - test either passes or fails
  • Focuses on behavioral correctness rather than response quality

Report Generation:

  • Consolidates results from multiple experiment runs
  • Provides human-readable markdown report
  • Includes interpretation of success rates and efficiency metrics

This strategy prioritizes reliability and policy adherence over response quality, ensuring the bot behaves correctly in production scenarios.

About

A 3-hour LLM-powered chatbot exercise for customer service bots: track orders, enforce 10-day cancellations, and evaluate with AISI's inspect.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published