Skip to content

New idea: Multi-turn & Compositional Reasoning Evaluation #62

@sharmaanchita

Description

@sharmaanchita

Problem
The system only supports single-shot prompt-response evaluation, missing critical real-world reasoning capabilities.

Basis of issue

  1. Multi-turn conversation evaluation
  2. Compositional task design (chained reasoning)
  3. Logical consistency / trace quality metrics
  4. Stateful prompt handling

Importance

  1. Real-world AI usage is multi-turn
  2. Compositional reasoning is a core capability
  3. Modern benchmarks already evaluate this

Current implementation gap

  1. Single prompt → single response only

Implementation checklist

  1. Support for multi-step prompt sequences
  2. Scoring based on reasoning consistency across turns
  3. Optional evaluation of intermediate reasoning quality
  4. Backward compatibility with single-shot prompts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions