Skip to content

[WIP] eval: adding full prompt eval harness#16

Draft
ishandhanani wants to merge 12 commits intomainfrom
eval
Draft

[WIP] eval: adding full prompt eval harness#16
ishandhanani wants to merge 12 commits intomainfrom
eval

Conversation

@ishandhanani
Copy link
Collaborator

@ishandhanani ishandhanani commented Nov 9, 2024

Add Prompt Evaluation Pipeline for Agent Service

Overview

This PR introduces a comprehensive prompt evaluation pipeline for the Agent Service, enabling systematic testing and validation of our LLM prompts. The system is designed to ensure prompt reliability and consistency across different models and use cases. We use promptfoo as our eval harness. Note promptfoo does not allow for a single config file where prompts can be tied to specific use caes which is why we devise this structure.

Key Components

1. Structured Testing Framework

  • Implemented a stage-based testing pipeline that follows our podcast generation workflow (see README.md lines 23-33)
  • Each stage corresponds to a specific transformation in our pipeline:
    • Raw outline generation
    • Structured outline conversion
    • Segment transcript generation
    • Dialogue creation and optimization

2. Provider Architecture

Created three distinct NIM providers to handle different model capabilities:

  • nim-8b.py: Optimized for JSON schema validation and structured outputs
  • nim-405b.py: Handles complex reasoning and content generation
  • nim-70b.py: Specialized for dialogue generation and natural language tasks

This was required for promptfoo as nims do not fully implement the OpenAI API spec.

3. Configuration Management

Implemented a YAML-based configuration system for test stages. Each stage has its own configuration file. There is also a helper function in order to retrieve and process previous prompts to emulate the concept of prompt-chaining.

  • Configs specify:
    • Input/output relationships
    • Model selection
    • Validation criteria
    • Schema enforcement

4. Test Runner Infrastructure

Created a flexible test runner that supports:

  • Sequential stage execution
  • Partial pipeline testing (--up-to flag)
  • Detailed logging and error reporting
  • Output persistence for analysis

5. Schema Generation

Automated schema generation from our Pydantic models:

  • Ensures type safety across the pipeline
  • Maintains consistency between runtime and test environments
  • Reduces manual schema maintenance

Development Workflow

Added a comprehensive Makefile to streamline the development process:

  • make test-prompts: Run full evaluation pipeline
  • make test-upto stage=N: Test specific stages
  • make test-list: View available test stages
  • make clean: Reset test environment

Design Decisions

  1. Stage Isolation: Each transformation step is isolated in its own configuration file, making it easier to:

    • Debug specific pipeline stages
    • Modify individual transformations
    • Add new capabilities without affecting existing ones
  2. Provider Specialization: Different models are used for different tasks based on their strengths:

    • 8B model for structured outputs
    • 405B model for reasoning
    • 70B model for natural language generation
  3. Schema-First Approach: By generating schemas from our Pydantic models, we ensure:

    • Type safety throughout the pipeline
    • Consistent validation between testing and production
    • Early detection of breaking changes
  4. Automated Testing: The pipeline is designed to be fully automated, enabling:

    • CI/CD integration
    • Regression testing
    • Performance monitoring across model versions

Testing

To test the changes:

  1. Install dependencies: make setup-test
  2. Run all tests: make test-prompts
  3. Test specific stages: make test-upto stage=2

Future Improvements

  1. A lot

@ishandhanani
Copy link
Collaborator Author

Pausing this right now because we're reworking our prompts and flow. But will come back to this when we finalize

@ishandhanani ishandhanani changed the title eval: adding full prompt eval harness [WIP] eval: adding full prompt eval harness Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant