Operational LLMOps workshop using Microsoft Foundry — focused on operationalizing an existing RAG chatbot, not building one from scratch. Covers automated evaluation workflows, model swap + versioning, CI/CD promotion gates (Azure DevOps), and MLflow GenAI integration.
Audience: Teams that already have a RAG chatbot and need to understand how to operationalize it — the "Ops" in LLMOps.
flowchart TB
subgraph Client["Client"]
UI["Web Chat UI<br/>index.html"]
end
subgraph Backend["Flask Backend"]
APP["app.py<br/>RBAC Auth"]
end
subgraph Azure["Azure Cloud"]
subgraph Foundry["Microsoft Foundry"]
PROJECT["proj-llmops-demo"]
INFERENCE["Inference API"]
GPT["gpt-4o<br/>Chat Completion"]
EMB["text-embedding-3-large<br/>Embeddings"]
SAFETY["Guardrails + Controls"]
EVAL["Evaluation"]
TRACE["Tracing"]
end
subgraph Search["Azure AI Search"]
INDEX["walle-products<br/>Vector Index"]
DOCS["8 Documents<br/>txt, md"]
end
end
subgraph LLMOps["LLMOps Modules"]
EVALMOD["02-evaluation/<br/>Groundedness, Fluency"]
SAFETYMOD["03-content-safety/<br/>Jailbreak Testing"]
end
subgraph DataFolder["data/"]
TXT["*.txt<br/>Product Specs"]
MD["*.md<br/>Policies"]
end
UI -->|"1. User Question"| APP
APP -->|"2. Embed Query"| INFERENCE
INFERENCE --> EMB
EMB -->|"3. Vector"| INDEX
INDEX -->|"4. Top 3 Docs"| APP
APP -->|"5. Context + Question"| INFERENCE
INFERENCE --> GPT
GPT -->|"6. Answer"| APP
APP -->|"7. Response"| UI
TXT --> INDEX
MD --> INDEX
PROJECT --> INFERENCE
APP -.->|"Telemetry"| TRACE
EVALMOD -.->|"Quality Metrics"| EVAL
SAFETYMOD -.->|"Test Filters"| SAFETY
sequenceDiagram
participant U as User
participant F as Flask App
participant P as Foundry Project
participant E as Embeddings
participant S as AI Search
participant G as GPT-4o
U->>F: "What's the return policy?"
F->>P: AIProjectClient
P->>E: Generate embedding
E-->>F: [0.123, -0.456, ...]
F->>S: Vector search (top 3)
S-->>F: Return Policy, Warranty, FAQ
F->>P: Chat completion
P->>G: System + Context + Question
G-->>F: "Wall-E offers 30-day returns..."
F-->>U: Formatted response
Operationalize a RAG chatbot ("Wall-E Electronics") deployed on Microsoft Foundry:
| Module | Topic | Key Concepts | Azure Services | Time |
|---|---|---|---|---|
| 1 | Orientation & Architecture | Pre-built RAG chatbot walkthrough | Microsoft Foundry, AI Search | 10 min |
| 2 | Automated Evaluation Workflows | Groundedness, relevance, similarity, fluency with pass/fail gates | Azure AI Evaluation SDK | 25 min |
| 3 | Model Swap + Re-Evaluation | Safely replace a model, auto-compare evaluations side-by-side | Foundry Model Catalog | 25 min |
| 4 | CI/CD with Promotion Gates | Azure DevOps pipeline with eval + safety gates blocking merges | Azure DevOps | 25 min |
| 5 | MLflow for GenAI Ops | Tracing, prompt versioning, mlflow.evaluate() | MLflow 2.18+ | 25 min |
| 6 | Q&A / Apply to Your System | Map patterns to your existing chatbot | — | 10 min |
Total Duration: ~120 minutes
- Run automated quality evaluations (4 metrics) with promotion gates
- Safely swap models and compare quality before promoting
- Build CI/CD pipelines (Azure DevOps) with evaluation and content safety gates
- Use MLflow tracing, prompt versioning, and evaluate() alongside Foundry
- Content safety testing with jailbreak detection
This workshop uses RBAC (Role-Based Access Control) — no API keys required.
Your Azure CLI credentials are used automatically via DefaultAzureCredential:
Cognitive Services OpenAI User— Call Foundry inference APIsSearch Index Data Contributor— Read/write search indicesSearch Service Contributor— Manage search service
- Azure subscription with Contributor access
- Azure CLI v2.50+
- Python 3.10+
- VS Code with Python extension
If your customer wants to run this workshop in their existing Azure environment (and does not want to deploy via infra/), follow the runbook:
The runbook covers both:
- Scenario A: run locally only
- Scenario B: deploy the frontend to Azure App Service (Managed Identity + RBAC)
# Clone the repository
git clone https://github.com/ritwickmicrosoft/llmops-workshop-demo-foundry.git
cd llmops-workshop-demo-foundry
# Create Python virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
# Login to Azure
az login- Go to Microsoft Foundry Portal
- Create a new project (e.g.,
proj-llmops-demo) - Deploy models from Model Catalog:
gpt-4ofor chat completionstext-embedding-3-largefor embeddings
- Note your Foundry endpoint:
https://<your-resource>.services.ai.azure.com
# Set variables
$env:AZURE_RESOURCE_GROUP = "rg-llmops-canadaeast"
$env:AZURE_LOCATION = "canadaeast"
# Create resource group
az group create --name $env:AZURE_RESOURCE_GROUP --location $env:AZURE_LOCATION
# Create Azure AI Search
az search service create `
--name "search-llmops-canadaeast" `
--resource-group $env:AZURE_RESOURCE_GROUP `
--location $env:AZURE_LOCATION `
--sku Basic
# Assign RBAC roles (replace with your Foundry resource)
$myId = (az ad signed-in-user show --query id -o tsv)
az role assignment create --assignee $myId --role "Search Index Data Contributor" `
--scope $(az search service show --name search-llmops-canadaeast --resource-group $env:AZURE_RESOURCE_GROUP --query id -o tsv)# Update .env with your resource endpoints
Copy-Item .env.example .env
# Edit .env with your endpoints
# Note: scripts automatically load .env if present
# Create search index with sample documents
cd 01-rag-chatbot
python create_search_index.pycd ../04-frontend
python app.py
# Open http://localhost:5000Open LLMOps_Workshop_Playbook.html in your browser for detailed step-by-step instructions.
llmops-workshop/
├── data/ # Sample documents (txt, md)
│ ├── laptop-pro-15.txt # Product specs
│ ├── smartwatch-x200.txt # Product specs
│ ├── nc500-headphones.txt # Product specs
│ ├── tablet-s10.txt # Product specs
│ ├── return-policy.md # Policy document
│ ├── warranty-policy.md # Policy document
│ ├── shipping-policy.md # Policy document
│ └── troubleshooting-guide.md # Support document
├── 01-rag-chatbot/ # RAG Chatbot (pre-built, orientation only)
│ └── create_search_index.py # Reads data/ folder, vectorizes, indexes
├── 02-evaluation/ # ★ Automated Evaluation Workflows
│ ├── eval_dataset.jsonl # Test dataset (Q&A pairs)
│ ├── run_evaluation.py # 4 metrics + promotion gates (+ optional portal upload)
│ └── eval_results/ # Generated reports (HTML + JSON)
├── 03-content-safety/ # Content Safety Module
│ ├── content_filter_config.json # Filter configuration
│ ├── test_content_safety.py # Test content filters
│ └── test_results/ # Generated reports (HTML + JSON)
├── 04-frontend/ # Web Chat Interface
│ ├── app.py # Flask backend (RBAC + Tracing)
│ ├── index.html # Dark-themed chat UI
│ ├── deploy-to-azure.ps1 # Azure App Service deployment script
│ └── requirements.txt # Frontend dependencies
├── 05-model-swap/ # ★ Model Swap + Re-Evaluation
│ ├── model_swap_eval.py # Compare models side-by-side
│ └── comparison_results/ # Generated comparison reports
├── 06-cicd/ # ★ CI/CD with Promotion Gates
│ ├── azure-pipelines.yml # Azure DevOps pipeline (4 stages)
│ ├── promotion_gate.py # Gate checker (eval, safety, comparison)
│ └── README.txt # Setup instructions
├── 07-mlflow/ # ★ MLflow GenAI Integration
│ ├── mlflow_tracing_demo.py # Tracing + prompt versioning
│ ├── mlflow_eval_demo.py # mlflow.evaluate() vs Azure AI Eval
│ └── mlflow_eval_results/ # Generated results
├── docs/ # Customer documentation
│ └── run-with-existing-azure-resources.md # BYOA runbook
├── infra/ # Infrastructure as Code
│ ├── main.bicep # Main Bicep template
│ └── modules/core.bicep # Core resources
├── .env.example # Environment template
├── requirements.txt # Python dependencies
├── LLMOps_Workshop_Playbook.html # Interactive step-by-step guide
└── README.md # This file
graph LR
subgraph RG["Resource Group: rg-llmops-canadaeast"]
A["Microsoft Foundry<br/>foundry-llmops-canadaeast"]
B["Foundry Project<br/>proj-llmops-demo"]
C["Azure AI Search<br/>search-llmops-canadaeast"]
end
A -->|"Contains"| B
B -->|"RBAC"| C
| Resource | Name | Purpose |
|---|---|---|
| Microsoft Foundry | foundry-llmops-canadaeast |
Unified AI platform with inference API |
| Foundry Project | proj-llmops-demo |
Models: gpt-4o, text-embedding-3-large |
| Azure AI Search | search-llmops-canadaeast |
Vector store for RAG |
The data/ folder contains 8 Wall-E Electronics documents:
| Format | Files | Description |
|---|---|---|
.txt |
4 files | Product specifications (Laptop, Watch, Headphones, Tablet) |
.md |
4 files | Policies & support (Returns, Warranty, Shipping, Troubleshooting) |
The create_search_index.py script automatically:
- Reads all files from
data/folder - Extracts text from .txt and .md files
- Generates vector embeddings using Microsoft Foundry inference API
- Uploads to Azure AI Search with semantic and vector search
The evaluation script (02-evaluation/run_evaluation.py) tests RAG quality using 4 metrics with pass/fail promotion gates:
| Metric | Description | Default Gate |
|---|---|---|
| Groundedness | Is the response supported by the retrieved context? | ≥4.0 |
| Relevance | Does the response address the user's question? | ≥4.0 |
| Similarity | How close is the response to the expected answer? | ≥4.0 |
| Fluency | Is the response grammatically correct and natural? | ≥4.0 |
| Score | Rating | Action |
|---|---|---|
| 4.0-5.0 | ✓ Excellent | Production-ready |
| 3.0-4.0 | ~ Good | Minor improvements needed |
| 2.0-3.0 | ⚠ Needs Work | Improve prompts or retrieval |
| 1.0-2.0 | ✗ Poor | Major rework required |
============================================================
Evaluation Results
============================================================
Aggregate Metrics:
----------------------------------------
✗ Groundedness 2.60/5.0
~ Relevance 3.20/5.0
~ Similarity 3.10/5.0
~ Fluency 3.00/5.0
----------------------------------------
❌ PROMOTION GATE: FAILED
→ Do NOT promote — improve metrics first
→ Failed metrics: groundedness, relevance, similarity, fluency
📊 Recommendations:
- Consider improving groundedness: current score 2.60 (gate ≥4.0)
- Consider improving relevance: current score 3.20 (gate ≥4.0)
- Consider improving similarity: current score 3.10 (gate ≥4.0)
- Consider improving fluency: current score 3.00 (gate ≥4.0)
Note: Low groundedness scores in demo are expected because the
contextfield ineval_dataset.jsonlonly contains document titles, not full text. In production with actual RAG retrieval, scores improve significantly.
# Evaluation with promotion gates (4 metrics)
python 02-evaluation/run_evaluation.py
# CI mode (exit code 1 if gate fails — for pipelines)
python 02-evaluation/run_evaluation.py --ci
# Custom thresholds
python 02-evaluation/run_evaluation.py --threshold 3.5
# Evaluate a specific model
python 02-evaluation/run_evaluation.py --model gpt-4o-mini --ci
# (Optional) Upload results to Foundry portal
# You can use --upload-to-portal or set EVAL_UPLOAD_TO_PORTAL=true
python 02-evaluation/run_evaluation.py --upload-to-portalIn production LLM applications, you frequently need to change models:
| Trigger | Example |
|---|---|
| Cost optimization | gpt-4o → gpt-4o-mini (~15x cheaper per token) while maintaining quality |
| Latency | Smaller models respond faster for real-time customer support |
| New model version | Azure ships gpt-4o 2024-11-20 with improved reasoning — validate before adopting |
| Model deprecation | Azure retires older model versions on published dates, forcing migration |
| Capacity/availability | Quota limits or regional availability require switching model families |
Risk: Swapping blindly can degrade answer quality — customers notice before your team does. This workflow prevents that by auto-evaluating both models on the same test set and rejecting if quality drops.
The model swap script (05-model-swap/model_swap_eval.py) safely compares two models before swapping:
- Evaluates the current model (baseline)
- Evaluates the candidate model
- Compares side-by-side with regression detection
- Produces a recommendation: swap or don't swap
# Default: compare gpt-4o vs gpt-4o-mini
python 05-model-swap/model_swap_eval.py
# Custom models
python 05-model-swap/model_swap_eval.py --current gpt-4o --candidate gpt-4o-mini
# CI mode (exit code 1 if swap not recommended)
python 05-model-swap/model_swap_eval.py --ciThe pipeline (06-cicd/azure-pipelines.yml) implements promotion gates:
┌─────────────────┐ ┌──────────────────┐ ┌────────────────┐ ┌──────────┐
│ EvaluationGate │───>│ ContentSafetyGate│───>│ ModelSwapGate │───>│ Deploy │
│ 4 metrics ≥ 4.0 │ │ Pass rate ≥ 90% │ │ No regression │ │ (main) │
└─────────────────┘ └──────────────────┘ └────────────────┘ └──────────┘
FAIL → Block PR FAIL → Block PR (optional) After all pass
| Stage | Trigger | Gate Logic |
|---|---|---|
| EvaluationGate | Every PR | All 4 metrics must meet thresholds |
| ContentSafetyGate | Every PR | ≥90% content safety test pass rate |
| ModelSwapGate | [model-swap] in commit msg |
Candidate meets thresholds + no regression |
| Deploy | Main branch only | After all gates pass |
- Create a Service Connection named
llmops-service-connection - Grant the service principal:
Cognitive Services OpenAI Useronfoundry-llmops-canadaeast - Create a pipeline pointing to
06-cicd/azure-pipelines.yml
# Test gate logic locally
python 06-cicd/promotion_gate.py --check-eval --results-dir 02-evaluation/eval_results
python 06-cicd/promotion_gate.py --check-content-safety --results-dir 03-content-safety/test_resultsThe MLflow modules (07-mlflow/) show how MLflow's newer GenAI features complement Foundry:
- Auto-tracing:
mlflow.openai.autolog()captures all LLM calls - Custom spans: RAG pipeline traced as
[Retrieval] → [Generation] - Prompt versioning: System prompts logged as versioned artifacts
- App versioning: Full RAG config (model + prompt + retrieval params) logged
- mlflow.evaluate(): Built-in QA metrics (accuracy, ROUGE, etc.)
- Custom evaluators: Domain-specific checks (length, citations, hallucination phrases)
- Comparison: When to use Azure AI Evaluation vs MLflow evaluate()
# Run tracing demo
python 07-mlflow/mlflow_tracing_demo.py
# Run evaluation demo
python 07-mlflow/mlflow_eval_demo.py
# View results in MLflow UI
mlflow ui --port 5001
# Open http://localhost:5001| Scenario | Tool |
|---|---|
| Development iteration (prompt tweaks, A/B testing) | MLflow |
| Production quality gates (CI/CD pipeline) | Azure AI Evaluation |
| Production monitoring | Foundry Tracing (App Insights) |
| Experiment tracking + comparison | MLflow |
| Compliance reporting | Azure AI Evaluation → Foundry Portal |
The content safety script (03-content-safety/test_content_safety.py) tests protection against harmful content and prompt injection using Microsoft Foundry.
| Category | Default Behavior | Description |
|---|---|---|
| Hate Speech | Filtered | Blocked automatically |
| Sexual Content | Filtered | Blocked automatically |
| Violence | Filtered | Blocked automatically |
| Self-Harm | Filtered | Blocked automatically |
| Jailbreak/Prompt Injection | Configurable | Enable via Guardrails + Controls |
| Category | Tests | Description |
|---|---|---|
baseline |
2 | Normal product queries |
prompt_injection |
3 | Jailbreak attempts (DAN, role-play) |
boundary |
3 | Off-topic, competitor, PII requests |
============================================================
Content Safety Testing Complete!
============================================================
Total Tests: 8
Passed: 8
Failed: 0
Pass Rate: 100.0%
Filter Blocked: 0
Model Refused: 8 (handled via system prompt)
Note: Jailbreak attempts are handled by the system prompt, not default content filters. The model correctly refuses malicious requests. For production, configure Guardrails + Controls in Foundry portal.
python 03-content-safety/test_content_safety.pyGenerates HTML report in 03-content-safety/test_results/.
Delete all resources when done:
az group delete --name rg-llmops-demo --yes --no-waitMIT License
LLMOps Workshop — Microsoft Foundry — March 2026