An interactive Streamlit app demonstrating how iterative evaluation drives better AI responses through systematic prompt improvement.
Instead of guessing why AI responses fail, this app demonstrates:
- Run evaluations with test questions
- Identify failures through automated checks
- Improve system prompt with targeted instructions
- Re-evaluate to confirm improvement
- Repeat until all evals pass
pip install -r requirements.txtstreamlit run app.pyYou'll need an Anthropic API key from console.anthropic.com
- 32-year-old professional
- Needs professional clothing for upcoming work presentations
- Needs it in 2 weeks
- Usually between sizes (struggles with fit)
- Anxious about online shopping
- Hates returns - wants to get it right the first time
- Budget: $150
- Purchase Decision: "I need professional clothing for work presentations. Should I order clothing ID 1094?"
- Quality Assessment: "Does clothing ID 829 have quality issues?"
- Sizing Guidance: "I'm between sizes (usually 8-10). Which size should I order for clothing ID 1094?"
Question 1 → Fails
- AI gives generic advice
- Doesn't include buy link
- Doesn't personalize to Sarah
- Add instruction: "Include buy links and mention Sarah by name with her specific concerns"
- Re-run → Passes! ✅
Question 2 → Fails
- Missing personalization to Sarah's professional needs
- Add instruction: "Connect recommendations to Sarah's work presentation context"
- Re-run → Passes! ✅
...and so on for all 3 questions!
-
Start with Question 1
- Review Sarah's question
- Enter your API key in the sidebar
- Click a question button to populate the chat
- Click "Send"
-
Check Results
- See AI's response
- Automatic evaluation shows pass/fail
- Both assertions must pass: Commercial Behavior (buy link) + Personalization (mention Sarah + her concerns)
-
Improve When Failed
- Review the failure reason and tip
- Update your system prompt
- Click the question again and "Send"
- See the improvement!
-
Progress Through All 3 Questions
- Each failure teaches a lesson
- System prompt gets better incrementally
- Final prompt is production-ready!
For each response, we check 2 assertions:
- Commercial Behavior: Includes buy link in format
https://santra.com/clothing/{id} - Personalization: Mentions Sarah BY NAME + references her specific concerns (sizing struggles, return aversion, anxiety, presentation needs, or budget)
evals-demo/
├── app.py # Main Streamlit app
├── claude_api.py # API integration & evaluation logic
├── requirements.txt # Dependencies
├── data/
│ └── evals_demo.db # SQLite database with reviews
└── README.md # This file
- Evals aren't just pass/fail - they teach you what to improve
- System prompts are critical - small changes → big impact
- Iterate systematically - fix one failure mode at a time
- Use real scenarios - Sarah's questions expose real gaps
- Measure improvement - quantify before/after
Hosted at: evals-cases-ms.streamlit.app
Want to add more assertions, new eval cases, or fix bugs?
PMs are welcome to vibe code and raise PRs at: github.com/Mitalee/evals-cases
MIT License
Built with ☕ + 🤖 by mmulpuru