feat: Add Model Comparison Framework by iltempo · Pull Request #5 · iltempo/interplay

iltempo · 2025-12-06T18:56:59Z

Summary

Add model comparison framework for systematically comparing AI model outputs (Haiku, Sonnet, Opus)
Execute same prompt against multiple models and save results for later review
Support blind evaluation mode for unbiased assessment
Allow rating patterns on musical criteria (rhythmic, dynamics, genre, overall)

New Commands

Model Selection:

--model haiku|sonnet|opus - CLI flag for model selection at startup
models - List available AI models with [active] marker
model <id> - Switch AI model at runtime

Comparison Testing:

compare <prompt> - Run prompt against all models, save results
compare-list - List saved comparisons
compare-view <id> - View comparison details
compare-load <id> <model> - Load pattern from comparison
compare-delete <id> - Delete a comparison

Evaluation:

compare-rate <id> <model> <criteria> <score> - Rate a model's output (1-5)
blind <id> - Enter blind evaluation mode with randomized labels (A, B, C)

Implementation

New comparison/ module with models, comparison, rating, and blind session logic
Extended ai/ module with model switching (SetModel/GetModel)
JSON persistence in comparisons/ directory
Progress feedback during comparison execution
Partial failure handling (continues with available models)

Test Plan

Build passes (go build ./...)
Tests pass (go test ./...)
Manual testing with MIDI device connected
Verify comparison saves to comparisons/ directory
Verify blind mode randomizes labels correctly

🤖 Generated with Claude Code

- Feature specification with 5 user stories (run comparison, review, model selection, rating, blind evaluation mode) - Implementation plan with technical context and constitution check - Research decisions on model switching, storage format, blind sessions - Data model with Comparison, ModelResult, Rating, BlindSession entities - CLI command contracts for 10 new commands - Quickstart guide for typical workflows - Task list with 59 tasks organized by user story 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implement foundational model switching and comparison test execution: - Add comparison/ module with model registry (Haiku, Sonnet, Opus) - Add --model flag for CLI model selection - Add compare command to run prompts against all models - Save comparison results to comparisons/ directory as JSON - Include progress feedback and partial failure handling Phase 2 (Foundational) and Phase 3 (User Story 1 - MVP) complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Phase 4 - Review Saved Comparisons: - compare-list: List all saved comparisons with summary - compare-view <id>: View detailed comparison results - compare-load <id> <model>: Load pattern from comparison - compare-delete <id>: Delete saved comparison Phase 5 - Select Model for Session: - models: List available AI models with [active] marker - model <id>: Switch AI model at runtime 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Phase 6 - Blind Evaluation Mode: - blind <id>: Enter blind evaluation with randomized labels - Load patterns anonymously (A, B, C), rate 1-5, then reveal - Ratings saved to comparison on reveal Phase 7 - Rate Comparison Results: - compare-rate <id> <model> <criteria> <score>: Rate patterns - Supports rhythmic, dynamics, genre, overall, or all criteria - Validates score range (1-5) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Reorganize help into logical sections - Add MODEL COMPARISON section with all new commands - Add AI FEATURES section with models/model commands - Simplify format for better readability 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

claude · 2025-12-06T18:59:01Z

Code Review: Model Comparison Framework

I've reviewed PR #5 and have detailed feedback on code quality, architecture, and potential improvements.

✅ Strengths

1. Excellent Architecture & Design

Clean separation of concerns: The comparison/ module is well-organized with distinct responsibilities (models, comparisons, ratings, blind sessions)
Interface-based design: CommandGenerator interface properly decouples comparison logic from AI client implementation
Extensible model registry: Easy to add new models via AvailableModels slice
Blind evaluation innovation: Creative use of Fisher-Yates shuffle with crypto/rand for unbiased testing

2. Good Go Idioms

Proper use of constants for enums (ComparisonStatus, ResultStatus)
Clear struct definitions with appropriate JSON tags
Receiver methods on structs follow Go conventions
Uses filepath.Join for cross-platform path handling

3. User Experience

Progress callbacks during comparison execution
Helpful error messages with usage examples
Interactive blind mode with clear instructions
Comprehensive help text organization

⚠️ Issues & Recommendations

1. Missing Test Coverage ⚠️ (Critical)

Problem: No test files exist for the comparison/ package.

Impact:

New features are untested and may have hidden bugs
Refactoring is risky without test safety net
Edge cases (empty comparisons, invalid ratings) are unchecked

Recommendation: Add comprehensive tests:

// comparison/comparison_test.go
func TestRunComparison(t *testing.T) { ... }
func TestSaveLoadComparison(t *testing.T) { ... }
func TestExecutePatternCommand(t *testing.T) { ... }

// comparison/blind_test.go  
func TestNewBlindSession(t *testing.T) { ... }
func TestShuffleStrings(t *testing.T) { ... }

// comparison/rating_test.go
func TestRatingValidation(t *testing.T) { ... }

Priority: High - This should block merging into main

2. Command Parsing Vulnerability 🔒 (Security)

Location: comparison/comparison.go:206-305 (executePatternCommand)

Problem: Uses fmt.Sscanf without validation, which can panic or produce unexpected results on malformed input.

// Current code - unsafe
var step int
fmt.Sscanf(parts[1], "%d", &step)  // No error check!
p.SetNote(step, midiNote)

Recommendation: Add proper error handling and bounds validation:

var step int
n, err := fmt.Sscanf(parts[1], "%d", &step)
if err != nil || n != 1 {
    return // silently skip malformed commands
}
if step < 1 || step > p.Length() {
    return // bounds check
}
p.SetNote(step, midiNote)

Priority: Medium - Could cause crashes with AI-generated malformed commands

3. Duplicate Code in Command Handler 📋 (Maintainability)

Location: commands/commands.go and comparison/comparison.go

Problem: executePatternCommand duplicates command parsing logic that already exists in Handler.ProcessCommand. This violates DRY principle.

Impact:

Bug fixes must be applied in two places
Behavior can drift between comparison mode and normal mode
Harder to maintain long-term

Recommendation: Extract command execution to shared module:

// sequence/commands.go (new file)
func ExecuteCommand(p *Pattern, cmdLine string) error {
    // Shared command execution logic
}

Then use in both Handler.ProcessCommand and executePatternCommand.

Priority: Medium - Technical debt that will worsen over time

4. Error Handling Issues ⚠️

a) Silent failures in `executePatternCommand`

Location: comparison/comparison.go:206-305

Commands silently fail when parsing errors occur. This makes debugging AI output difficult.

Recommendation: Return errors or log warnings:

func executePatternCommand(p *sequence.Pattern, cmd string) error {
    // ... parsing logic ...
    if err != nil {
        return fmt.Errorf("invalid command %q: %w", cmd, err)
    }
}

b) Missing context in error messages

Location: Multiple handlers in commands/commands.go

// Current
return fmt.Errorf("comparison '%s' not found", id)

// Better - add context
return fmt.Errorf("failed to load comparison '%s': file not found in %s/", id, comparison.ComparisonsDir)

Priority: Low-Medium

5. Resource Cleanup 💾 (Performance)

Location: commands/commands.go:1008-1015 (blindModeLoop)

Problem: If readline initialization fails, the defer won't execute because rl is nil.

rl, err := readline.New("blind> ")
if err != nil {
    return fmt.Errorf("failed to initialize readline: %w", err)
}
defer rl.Close()  // ✓ OK in this case, but pattern is fragile

Recommendation: This specific case is safe, but consider using early return pattern for consistency.

Priority: Low

6. API Design Considerations 🎯

a) Confusing model switching behavior

Location: ai/ai.go:323-331 (GenerateCommandsWithModel)

The method returns (commands, rawResponse, error) but the regular GenerateCommands discards rawResponse. This is inconsistent.

Recommendation:

Keep GenerateCommands as is for backward compatibility
Document that GenerateCommandsWithModel is for comparison testing
Consider renaming to GenerateCommandsForComparison for clarity

b) Global state in command handler

Location: commands/commands.go:27-30

currentModel is stored in handler but could drift from aiClient.GetModel().

Recommendation: Remove currentModel field and call aiClient.GetModel() when needed. Single source of truth.

Priority: Low-Medium

7. Missing Input Validation ⚠️

Location: Various command handlers

Examples:

handleCompare: No max length check on prompt (could be extremely long)
handleBlind: No validation that comparison has successful results before creating session
handleCompareRate: Model ID validation happens late in the flow

Recommendation: Add validation guards at the start of each handler.

Priority: Medium

🔍 Minor Issues

Code Style

Inconsistent error messages: Some use quotes, some don't
- fmt.Errorf("comparison '%s' not found", id) vs fmt.Errorf("unknown model: %s", modelID)
Magic numbers: truncateString(result.RawResponse, 200) - define as constant
Unused return values: comparison/comparison.go:198 - min() function used but not defined (likely build error)

Documentation

Missing package-level documentation for comparison/ package
Complex functions lack comments (e.g., executePatternCommand, blindModeLoop)
No examples in command help for new features

📊 Performance Considerations

1. Sequential model execution

Location: comparison/comparison.go:127-142

Models run sequentially, not in parallel. For 3 models, this is 3x slower.

Recommendation: Consider parallel execution with goroutines:

var wg sync.WaitGroup
results := make([]ModelResult, len(AvailableModels))

for i, modelConfig := range AvailableModels {
    wg.Add(1)
    go func(idx int, cfg ModelConfig) {
        defer wg.Done()
        results[idx] = executePromptForModel(ctx, prompt, generator, cfg)
    }(i, modelConfig)
}
wg.Wait()

Trade-off: Increases API cost (parallel requests) but improves UX.

Priority: Low - Nice to have for future optimization

🎯 Recommendations Summary

Must Fix (Before Merge)

✅ Add test coverage for comparison/ package
✅ Fix command parsing error handling in executePatternCommand

Should Fix (Near Term)

Deduplicate command execution logic
Improve error messages with context
Validate inputs more rigorously

Nice to Have (Future)

Parallel model execution
Better API consistency
Enhanced documentation

🎉 Overall Assessment

This is a well-designed feature with clean architecture and thoughtful UX. The blind evaluation mode is particularly innovative. However, test coverage is critical before merging to main.

Recommendation: Request changes for test coverage, then approve once tests are added.

Estimated effort to address critical issues: 2-4 hours for comprehensive tests

Great work on this feature! The model comparison framework will be very useful for evaluating AI performance. Let me know if you'd like help writing the tests.

Review generated by Claude Code

iltempo and others added 5 commits December 6, 2025 16:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Model Comparison Framework#5

feat: Add Model Comparison Framework#5
iltempo wants to merge 5 commits intomainfrom
003-model-comparison

iltempo commented Dec 6, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iltempo commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Commands

Implementation

Test Plan

Uh oh!

claude bot commented Dec 6, 2025

Code Review: Model Comparison Framework

✅ Strengths

1. Excellent Architecture & Design

2. Good Go Idioms

3. User Experience

⚠️ Issues & Recommendations

1. Missing Test Coverage ⚠️ (Critical)

2. Command Parsing Vulnerability 🔒 (Security)

3. Duplicate Code in Command Handler 📋 (Maintainability)

4. Error Handling Issues ⚠️

a) Silent failures in executePatternCommand

b) Missing context in error messages

5. Resource Cleanup 💾 (Performance)

6. API Design Considerations 🎯

a) Confusing model switching behavior

b) Global state in command handler

7. Missing Input Validation ⚠️

🔍 Minor Issues

Code Style

Documentation

📊 Performance Considerations

1. Sequential model execution

🎯 Recommendations Summary

Must Fix (Before Merge)

Should Fix (Near Term)

Nice to Have (Future)

🎉 Overall Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

iltempo commented Dec 6, 2025 •

edited

Loading

a) Silent failures in `executePatternCommand`