A hands-on framework for detecting and visualizing behavioral drift in Large Language Models (LLMs) across versions and providers.
Model updates often happen silently. Prompt behavior subtly shifts. Outputs change tone, verbosity, or factuality β without a version bump or changelog.
If you're building real-world systems on top of LLMs, this is not an edge case. It's your prod environment.
This repo helps you:
- Track instruction-following degradation
- Compare hallucination control across models
- Evaluate tone and verbosity drift
- Visualize changes over time or across vendors
Prompt suites organized by category:
instruction-followingtone-style-consistencyhallucination-control
Define your metrics: factuality, clarity, verbosity, alignment, etc.
Raw completions from models like GPT-4, GPT-4.1, Claude, Mistral
Includes drift_analysis.ipynb β for analyzing and visualizing drift across versions
β
Choosing between models for product integration
β
Verifying that model upgrades donβt silently break UX
β
Building trust in AI-powered systems through stability
β
Equipping PMs and engineers with repeatable drift detection
- Session-based behavioral fingerprinting
- Streaming output drift (for GPT-4o)
- Regression alerts via GitHub Actions
- Drop your prompts into
drift-tests/ - Run completions via API or platform
- Save raw outputs in
outputs/ - Analyze drift in
notebooks/
Evaluation isn't just about "score".
It's about knowing when your model has changed β before your users tell you.