A simple, text-only playground for evaluating reasoning model outputs.
Local, lightweight, and perfect for PMs exploring AI reliability.
python eval_metrics_lab.pyEvaluating 3 mock responses...
Accuracy: 0.87
Hallucination: 0.11
Trust Score: 76.0
✅ Model reliability acceptable.