-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
Problem
The system only supports single-shot prompt-response evaluation, missing critical real-world reasoning capabilities.
Basis of issue
- Multi-turn conversation evaluation
- Compositional task design (chained reasoning)
- Logical consistency / trace quality metrics
- Stateful prompt handling
Importance
- Real-world AI usage is multi-turn
- Compositional reasoning is a core capability
- Modern benchmarks already evaluate this
Current implementation gap
- Single prompt → single response only
Implementation checklist
- Support for multi-step prompt sequences
- Scoring based on reasoning consistency across turns
- Optional evaluation of intermediate reasoning quality
- Backward compatibility with single-shot prompts
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels