Skip to content

feat: Add Model Comparison Framework#5

Open
iltempo wants to merge 5 commits intomainfrom
003-model-comparison
Open

feat: Add Model Comparison Framework#5
iltempo wants to merge 5 commits intomainfrom
003-model-comparison

Conversation

@iltempo
Copy link
Owner

@iltempo iltempo commented Dec 6, 2025

Summary

  • Add model comparison framework for systematically comparing AI model outputs (Haiku, Sonnet, Opus)
  • Execute same prompt against multiple models and save results for later review
  • Support blind evaluation mode for unbiased assessment
  • Allow rating patterns on musical criteria (rhythmic, dynamics, genre, overall)

New Commands

Model Selection:

  • --model haiku|sonnet|opus - CLI flag for model selection at startup
  • models - List available AI models with [active] marker
  • model <id> - Switch AI model at runtime

Comparison Testing:

  • compare <prompt> - Run prompt against all models, save results
  • compare-list - List saved comparisons
  • compare-view <id> - View comparison details
  • compare-load <id> <model> - Load pattern from comparison
  • compare-delete <id> - Delete a comparison

Evaluation:

  • compare-rate <id> <model> <criteria> <score> - Rate a model's output (1-5)
  • blind <id> - Enter blind evaluation mode with randomized labels (A, B, C)

Implementation

  • New comparison/ module with models, comparison, rating, and blind session logic
  • Extended ai/ module with model switching (SetModel/GetModel)
  • JSON persistence in comparisons/ directory
  • Progress feedback during comparison execution
  • Partial failure handling (continues with available models)

Test Plan

  • Build passes (go build ./...)
  • Tests pass (go test ./...)
  • Manual testing with MIDI device connected
  • Verify comparison saves to comparisons/ directory
  • Verify blind mode randomizes labels correctly

🤖 Generated with Claude Code

iltempo and others added 5 commits December 6, 2025 16:18
- Feature specification with 5 user stories (run comparison, review,
  model selection, rating, blind evaluation mode)
- Implementation plan with technical context and constitution check
- Research decisions on model switching, storage format, blind sessions
- Data model with Comparison, ModelResult, Rating, BlindSession entities
- CLI command contracts for 10 new commands
- Quickstart guide for typical workflows
- Task list with 59 tasks organized by user story

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement foundational model switching and comparison test execution:

- Add comparison/ module with model registry (Haiku, Sonnet, Opus)
- Add --model flag for CLI model selection
- Add compare command to run prompts against all models
- Save comparison results to comparisons/ directory as JSON
- Include progress feedback and partial failure handling

Phase 2 (Foundational) and Phase 3 (User Story 1 - MVP) complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 4 - Review Saved Comparisons:
- compare-list: List all saved comparisons with summary
- compare-view <id>: View detailed comparison results
- compare-load <id> <model>: Load pattern from comparison
- compare-delete <id>: Delete saved comparison

Phase 5 - Select Model for Session:
- models: List available AI models with [active] marker
- model <id>: Switch AI model at runtime

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 6 - Blind Evaluation Mode:
- blind <id>: Enter blind evaluation with randomized labels
- Load patterns anonymously (A, B, C), rate 1-5, then reveal
- Ratings saved to comparison on reveal

Phase 7 - Rate Comparison Results:
- compare-rate <id> <model> <criteria> <score>: Rate patterns
- Supports rhythmic, dynamics, genre, overall, or all criteria
- Validates score range (1-5)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Reorganize help into logical sections
- Add MODEL COMPARISON section with all new commands
- Add AI FEATURES section with models/model commands
- Simplify format for better readability

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Dec 6, 2025

Code Review: Model Comparison Framework

I've reviewed PR #5 and have detailed feedback on code quality, architecture, and potential improvements.


Strengths

1. Excellent Architecture & Design

  • Clean separation of concerns: The comparison/ module is well-organized with distinct responsibilities (models, comparisons, ratings, blind sessions)
  • Interface-based design: CommandGenerator interface properly decouples comparison logic from AI client implementation
  • Extensible model registry: Easy to add new models via AvailableModels slice
  • Blind evaluation innovation: Creative use of Fisher-Yates shuffle with crypto/rand for unbiased testing

2. Good Go Idioms

  • Proper use of constants for enums (ComparisonStatus, ResultStatus)
  • Clear struct definitions with appropriate JSON tags
  • Receiver methods on structs follow Go conventions
  • Uses filepath.Join for cross-platform path handling

3. User Experience

  • Progress callbacks during comparison execution
  • Helpful error messages with usage examples
  • Interactive blind mode with clear instructions
  • Comprehensive help text organization

⚠️ Issues & Recommendations

1. Missing Test Coverage ⚠️ (Critical)

Problem: No test files exist for the comparison/ package.

Impact:

  • New features are untested and may have hidden bugs
  • Refactoring is risky without test safety net
  • Edge cases (empty comparisons, invalid ratings) are unchecked

Recommendation: Add comprehensive tests:

// comparison/comparison_test.go
func TestRunComparison(t *testing.T) { ... }
func TestSaveLoadComparison(t *testing.T) { ... }
func TestExecutePatternCommand(t *testing.T) { ... }

// comparison/blind_test.go  
func TestNewBlindSession(t *testing.T) { ... }
func TestShuffleStrings(t *testing.T) { ... }

// comparison/rating_test.go
func TestRatingValidation(t *testing.T) { ... }

Priority: High - This should block merging into main


2. Command Parsing Vulnerability 🔒 (Security)

Location: comparison/comparison.go:206-305 (executePatternCommand)

Problem: Uses fmt.Sscanf without validation, which can panic or produce unexpected results on malformed input.

// Current code - unsafe
var step int
fmt.Sscanf(parts[1], "%d", &step)  // No error check!
p.SetNote(step, midiNote)

Recommendation: Add proper error handling and bounds validation:

var step int
n, err := fmt.Sscanf(parts[1], "%d", &step)
if err != nil || n != 1 {
    return // silently skip malformed commands
}
if step < 1 || step > p.Length() {
    return // bounds check
}
p.SetNote(step, midiNote)

Priority: Medium - Could cause crashes with AI-generated malformed commands


3. Duplicate Code in Command Handler 📋 (Maintainability)

Location: commands/commands.go and comparison/comparison.go

Problem: executePatternCommand duplicates command parsing logic that already exists in Handler.ProcessCommand. This violates DRY principle.

Impact:

  • Bug fixes must be applied in two places
  • Behavior can drift between comparison mode and normal mode
  • Harder to maintain long-term

Recommendation: Extract command execution to shared module:

// sequence/commands.go (new file)
func ExecuteCommand(p *Pattern, cmdLine string) error {
    // Shared command execution logic
}

Then use in both Handler.ProcessCommand and executePatternCommand.

Priority: Medium - Technical debt that will worsen over time


4. Error Handling Issues ⚠️

a) Silent failures in executePatternCommand

Location: comparison/comparison.go:206-305

Commands silently fail when parsing errors occur. This makes debugging AI output difficult.

Recommendation: Return errors or log warnings:

func executePatternCommand(p *sequence.Pattern, cmd string) error {
    // ... parsing logic ...
    if err != nil {
        return fmt.Errorf("invalid command %q: %w", cmd, err)
    }
}

b) Missing context in error messages

Location: Multiple handlers in commands/commands.go

// Current
return fmt.Errorf("comparison '%s' not found", id)

// Better - add context
return fmt.Errorf("failed to load comparison '%s': file not found in %s/", id, comparison.ComparisonsDir)

Priority: Low-Medium


5. Resource Cleanup 💾 (Performance)

Location: commands/commands.go:1008-1015 (blindModeLoop)

Problem: If readline initialization fails, the defer won't execute because rl is nil.

rl, err := readline.New("blind> ")
if err != nil {
    return fmt.Errorf("failed to initialize readline: %w", err)
}
defer rl.Close()  // ✓ OK in this case, but pattern is fragile

Recommendation: This specific case is safe, but consider using early return pattern for consistency.

Priority: Low


6. API Design Considerations 🎯

a) Confusing model switching behavior

Location: ai/ai.go:323-331 (GenerateCommandsWithModel)

The method returns (commands, rawResponse, error) but the regular GenerateCommands discards rawResponse. This is inconsistent.

Recommendation:

  • Keep GenerateCommands as is for backward compatibility
  • Document that GenerateCommandsWithModel is for comparison testing
  • Consider renaming to GenerateCommandsForComparison for clarity

b) Global state in command handler

Location: commands/commands.go:27-30

currentModel is stored in handler but could drift from aiClient.GetModel().

Recommendation: Remove currentModel field and call aiClient.GetModel() when needed. Single source of truth.

Priority: Low-Medium


7. Missing Input Validation ⚠️

Location: Various command handlers

Examples:

  • handleCompare: No max length check on prompt (could be extremely long)
  • handleBlind: No validation that comparison has successful results before creating session
  • handleCompareRate: Model ID validation happens late in the flow

Recommendation: Add validation guards at the start of each handler.

Priority: Medium


🔍 Minor Issues

Code Style

  1. Inconsistent error messages: Some use quotes, some don't

    • fmt.Errorf("comparison '%s' not found", id) vs fmt.Errorf("unknown model: %s", modelID)
  2. Magic numbers: truncateString(result.RawResponse, 200) - define as constant

  3. Unused return values: comparison/comparison.go:198 - min() function used but not defined (likely build error)

Documentation

  1. Missing package-level documentation for comparison/ package
  2. Complex functions lack comments (e.g., executePatternCommand, blindModeLoop)
  3. No examples in command help for new features

📊 Performance Considerations

1. Sequential model execution

Location: comparison/comparison.go:127-142

Models run sequentially, not in parallel. For 3 models, this is 3x slower.

Recommendation: Consider parallel execution with goroutines:

var wg sync.WaitGroup
results := make([]ModelResult, len(AvailableModels))

for i, modelConfig := range AvailableModels {
    wg.Add(1)
    go func(idx int, cfg ModelConfig) {
        defer wg.Done()
        results[idx] = executePromptForModel(ctx, prompt, generator, cfg)
    }(i, modelConfig)
}
wg.Wait()

Trade-off: Increases API cost (parallel requests) but improves UX.

Priority: Low - Nice to have for future optimization


🎯 Recommendations Summary

Must Fix (Before Merge)

  1. ✅ Add test coverage for comparison/ package
  2. ✅ Fix command parsing error handling in executePatternCommand

Should Fix (Near Term)

  1. Deduplicate command execution logic
  2. Improve error messages with context
  3. Validate inputs more rigorously

Nice to Have (Future)

  1. Parallel model execution
  2. Better API consistency
  3. Enhanced documentation

🎉 Overall Assessment

This is a well-designed feature with clean architecture and thoughtful UX. The blind evaluation mode is particularly innovative. However, test coverage is critical before merging to main.

Recommendation: Request changes for test coverage, then approve once tests are added.

Estimated effort to address critical issues: 2-4 hours for comprehensive tests


Great work on this feature! The model comparison framework will be very useful for evaluating AI performance. Let me know if you'd like help writing the tests.


Review generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant