[WIP] eval: adding full prompt eval harness by ishandhanani · Pull Request #16 · brevdev/pdf2podcast

ishandhanani · 2024-11-09T21:22:05Z

Add Prompt Evaluation Pipeline for Agent Service

Overview

This PR introduces a comprehensive prompt evaluation pipeline for the Agent Service, enabling systematic testing and validation of our LLM prompts. The system is designed to ensure prompt reliability and consistency across different models and use cases. We use promptfoo as our eval harness. Note promptfoo does not allow for a single config file where prompts can be tied to specific use caes which is why we devise this structure.

Key Components

1. Structured Testing Framework

Implemented a stage-based testing pipeline that follows our podcast generation workflow (see README.md lines 23-33)
Each stage corresponds to a specific transformation in our pipeline:
- Raw outline generation
- Structured outline conversion
- Segment transcript generation
- Dialogue creation and optimization

2. Provider Architecture

Created three distinct NIM providers to handle different model capabilities:

nim-8b.py: Optimized for JSON schema validation and structured outputs
nim-405b.py: Handles complex reasoning and content generation
nim-70b.py: Specialized for dialogue generation and natural language tasks

This was required for promptfoo as nims do not fully implement the OpenAI API spec.

3. Configuration Management

Implemented a YAML-based configuration system for test stages. Each stage has its own configuration file. There is also a helper function in order to retrieve and process previous prompts to emulate the concept of prompt-chaining.

Configs specify:
- Input/output relationships
- Model selection
- Validation criteria
- Schema enforcement

4. Test Runner Infrastructure

Created a flexible test runner that supports:

Sequential stage execution
Partial pipeline testing (--up-to flag)
Detailed logging and error reporting
Output persistence for analysis

5. Schema Generation

Automated schema generation from our Pydantic models:

Ensures type safety across the pipeline
Maintains consistency between runtime and test environments
Reduces manual schema maintenance

Development Workflow

Added a comprehensive Makefile to streamline the development process:

make test-prompts: Run full evaluation pipeline
make test-upto stage=N: Test specific stages
make test-list: View available test stages
make clean: Reset test environment

Design Decisions

Stage Isolation: Each transformation step is isolated in its own configuration file, making it easier to:
- Debug specific pipeline stages
- Modify individual transformations
- Add new capabilities without affecting existing ones
Provider Specialization: Different models are used for different tasks based on their strengths:
- 8B model for structured outputs
- 405B model for reasoning
- 70B model for natural language generation
Schema-First Approach: By generating schemas from our Pydantic models, we ensure:
- Type safety throughout the pipeline
- Consistent validation between testing and production
- Early detection of breaking changes
Automated Testing: The pipeline is designed to be fully automated, enabling:
- CI/CD integration
- Regression testing
- Performance monitoring across model versions

Testing

To test the changes:

Install dependencies: make setup-test
Run all tests: make test-prompts
Test specific stages: make test-upto stage=2

Future Improvements

A lot

ishandhanani · 2024-11-11T00:08:05Z

Pausing this right now because we're reworking our prompts and flow. But will come back to this when we finalize

ishandhanani added 11 commits November 9, 2024 12:06

working example

3357f7a

iter

9e05024

iter

e0246f7

Merge branch 'main' into eval

3145e69

mega refactor

9e94c3d

big boy

ee24515

big refactor again

07d2c32

works for 1 and 2 now

5a79dec

moar

e7548c5

Merge branch 'main' into eval

b1be5c0

ruff

2c885e6

ishandhanani changed the title ~~eval: adding full prompt eval harness~~ [WIP] eval: adding full prompt eval harness Nov 11, 2024

prompts -> pkg

e8ae615

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] eval: adding full prompt eval harness#16

[WIP] eval: adding full prompt eval harness#16
ishandhanani wants to merge 12 commits intomainfrom
eval

ishandhanani commented Nov 9, 2024 •

edited

Loading

Uh oh!

ishandhanani commented Nov 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ishandhanani commented Nov 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Prompt Evaluation Pipeline for Agent Service

Overview

Key Components

1. Structured Testing Framework

2. Provider Architecture

3. Configuration Management

4. Test Runner Infrastructure

5. Schema Generation

Development Workflow

Design Decisions

Testing

Future Improvements

Uh oh!

ishandhanani commented Nov 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ishandhanani commented Nov 9, 2024 •

edited

Loading