diff --git a/docs/codex-vs-gemini-comparison.md b/docs/codex-vs-gemini-comparison.md new file mode 100644 index 0000000..19ad828 --- /dev/null +++ b/docs/codex-vs-gemini-comparison.md @@ -0,0 +1,344 @@ +# Codex vs Gemini CLI: Customization and Extensibility Comparison + +A comprehensive analysis of two leading AI-powered CLI tools, focusing on their customization capabilities and extensibility features. + +## Executive Summary + +**Codex CLI** (OpenAI) and **Gemini CLI** (Google) represent two distinct approaches to AI-powered command-line development tools. While both prioritize customization and extensibility, they employ fundamentally different architectural philosophies: + +- **Codex** emphasizes a **minimalist, protocol-driven architecture** with strong security guarantees +- **Gemini** focuses on a **feature-rich, user-friendly experience** with comprehensive built-in capabilities + +## Project Overview + +### Codex CLI (OpenAI) + +- **Language**: Rust +- **Philosophy**: Lightweight coding agent with security-first design +- **Target**: Terminal-focused developers who prioritize performance and security +- **Architecture**: Protocol-driven client-server with multiple UI frontends + +### Gemini CLI (Google) + +- **Language**: TypeScript/Node.js +- **Philosophy**: Comprehensive AI workflow tool with rich feature set +- **Target**: Developers working with large codebases and complex workflows +- **Architecture**: Package-based modular system with React-powered terminal UI + +## Customization Capabilities + +### Configuration Systems + +| Aspect | Codex | Gemini | +| ------------------------- | -------------------------------- | --------------------------------------------------- | +| **Format** | TOML with comments | JSON with environment variable expansion | +| **Locations** | `~/.codex/config.toml` | `~/.gemini/settings.json` + `.gemini/settings.json` | +| **Environment Variables** | Limited (`CODEX_HOME`, API keys) | Extensive (`$VAR` expansion, hierarchical `.env`) | +| **Profiles** | ✅ Named profile system | ❌ No profiles (workspace settings instead) | +| **Runtime Changes** | ❌ Requires restart | ✅ Dynamic updates supported | + +#### Codex Configuration Strengths + +- **Profile System**: Switch between complete configuration sets +- **TOML Format**: Human-readable with comment support +- **Strong Typing**: Rust-based validation prevents configuration errors +- **Hierarchical Precedence**: CLI flags → profiles → base config → defaults + +#### Gemini Configuration Strengths + +- **Two-Tier Settings**: User and workspace-specific configurations +- **Dynamic Updates**: Settings can be modified during runtime +- **Environment Integration**: Extensive variable expansion and `.env` support +- **Hierarchical Context**: Multi-level context file inheritance + +### Model and Provider Customization + +#### Codex + +```toml +[model_providers.custom-llm] +name = "Custom LLM Provider" +base_url = "https://api.custom.com/v1" +env_key = "CUSTOM_API_KEY" +wire_api = "chat" +query_params = { api-version = "2025-04-01" } + +[profiles.development] +model = "gpt-4" +approval_policy = "untrusted" +sandbox = "workspace-write" +``` + +#### Gemini + +```json +{ + "theme": "atom-one", + "coreTools": ["file_system", "shell", "web"], + "mcpServers": { + "custom-server": { + "command": "python", + "args": ["server.py"], + "trust": true + } + } +} +``` + +## Extensibility Features + +### Extension Architecture Comparison + +| Feature | Codex | Gemini | +| ----------------------- | ----------------------------- | --------------------------------- | +| **Plugin System** | ❌ Configuration-based only | ✅ Directory-based extensions | +| **MCP Support** | ✅ Native Rust implementation | ✅ Official SDK integration | +| **Custom Tools** | Via MCP servers only | MCP + custom tool discovery | +| **Extension Discovery** | Configuration files | `.gemini/extensions/` directories | +| **API Access** | Protocol-based communication | Direct TypeScript interfaces | + +### Model Context Protocol (MCP) Integration + +Both projects heavily leverage MCP for extensibility, but with different implementations: + +#### Codex MCP Features + +- **Native Rust Implementation**: Custom MCP client/server in Rust +- **Connection Manager**: Sophisticated multi-server management +- **Tool Namespacing**: Prevents conflicts with `server__OAI_CODEX_MCP__tool` format +- **Security**: Sandboxed server execution with timeout management +- **Performance**: Concurrent server spawning and tool aggregation + +#### Gemini MCP Features + +- **Official SDK**: Uses standard MCP TypeScript SDK +- **Multiple Transports**: Stdio, SSE, and HTTP support +- **Trust Model**: Trust-based confirmation bypass for known servers +- **Extension Integration**: MCP servers can be defined per-extension +- **Tool Registry**: Centralized tool discovery and registration + +### Custom Tool Development + +#### Codex Approach + +```rust +// Tools must be implemented as MCP servers +// Example: Custom file processor MCP server +pub struct CustomTool { + name: String, + description: String, +} + +impl McpTool for CustomTool { + async fn execute(&self, params: serde_json::Value) -> Result { + // Custom tool implementation + } +} +``` + +#### Gemini Approach + +```typescript +// Built-in tool interface +export class CustomFileTool implements Tool { + name = "custom_file_tool"; + displayName = "Custom File Tool"; + description = "Processes files with custom logic"; + + async execute(params: CustomParams): Promise { + // Direct tool implementation + } +} + +// Or via tool discovery command +{ + "toolDiscoveryCommand": "python discover_tools.py", + "toolCallCommand": "python execute_tool.py" +} +``` + +### Extension Examples + +#### Codex Extensions (via MCP) + +- **Custom Model Providers**: Add any OpenAI-compatible API +- **External Tools**: Database clients, cloud providers, monitoring systems +- **Notification Integrations**: Slack, Discord, email notifications +- **Development Tools**: Linters, formatters, test runners + +#### Gemini Extensions + +```json +{ + "name": "development-tools", + "version": "1.0.0", + "mcpServers": { + "linter": { "command": "eslint-mcp-server" }, + "formatter": { "command": "prettier-mcp-server" } + }, + "excludeTools": ["web_search"], + "contextFileName": "DEV_CONTEXT.md" +} +``` + +## Security and Sandboxing + +### Codex Security Model + +- **Multi-layered Approval**: Separate workflows for commands and file patches +- **Native Sandboxing**: Landlock (Linux) and Seatbelt (macOS) integration +- **Execution Policies**: Configurable safety checks and command restrictions +- **MCP Isolation**: Server process isolation with resource limits + +### Gemini Security Model + +- **Confirmation-based**: User prompts for destructive operations +- **Trust System**: Bypass confirmations for trusted MCP servers +- **Docker Sandboxing**: Optional containerized execution environment +- **Allowlisting**: Server and tool-level access controls + +## Performance and Architecture + +### System Architecture + +#### Codex: Protocol-Driven Design + +``` +┌─────────────┐ Protocol ┌──────────────┐ +│ TUI/CLI/Exec│ ◄──────────── │ Core Library │ +└─────────────┘ └──────────────┘ + │ + ┌──────▼──────┐ + │ MCP Manager │ + └─────────────┘ +``` + +**Benefits:** + +- Transport-agnostic communication +- Multiple UI implementations possible +- Clear separation of concerns +- Native performance with zero-cost abstractions + +#### Gemini: Package-Based Modularity + +``` +┌─────────────┐ Direct Calls ┌──────────────┐ +│ CLI Package │ ◄────────────────► │ Core Package │ +└─────────────┘ └──────────────┘ + │ + ┌──────▼──────┐ + │ Tool Registry│ + └─────────────┘ +``` + +**Benefits:** + +- Flexible tool registration +- Modern TypeScript ecosystem +- React-based terminal UI +- Comprehensive built-in toolset + +### Performance Characteristics + +| Metric | Codex | Gemini | +| ------------------------- | ----------------------- | --------------------------- | +| **Startup Time** | Fast (native binary) | Moderate (Node.js runtime) | +| **Memory Usage** | Low (Rust efficiency) | Higher (JavaScript runtime) | +| **Tool Execution** | Fast (native + MCP) | Good (async/await patterns) | +| **Concurrent Operations** | Excellent (Rust async) | Good (Node.js event loop) | +| **File Operations** | Moderate (MCP overhead) | Excellent (built-in tools) | + +## Use Case Recommendations + +### Choose Codex If You Need: + +- **Maximum Security**: Native sandboxing and strict approval workflows +- **Performance**: Native execution with minimal resource usage +- **Flexibility**: Protocol-driven architecture supporting multiple frontends +- **Extensibility via MCP**: Rich ecosystem of external MCP servers +- **Terminal-First**: Optimized for terminal-based development workflows + +### Choose Gemini If You Need: + +- **Rich Built-in Tools**: Comprehensive file system and web capabilities +- **User-Friendly UX**: Modern terminal UI with React components +- **Large Codebase Support**: 1M+ token context windows +- **Rapid Prototyping**: Extensive built-in functionality reduces setup time +- **Google Ecosystem**: Integration with Google services and authentication + +## Customization Examples + +### Codex: Creating a Development Profile + +```toml +# ~/.codex/config.toml +[profiles.development] +model = "gpt-4" +approval_policy = "untrusted" +sandbox = "workspace-write" +notify = ["osascript", "-e", "display notification \"Task completed\""] + +[profiles.production] +model = "o1" +approval_policy = "on-failure" +sandbox = "read-only" + +[mcp_servers.database] +command = "python" +args = ["-m", "database_mcp_server"] +env = { DB_URL = "$DATABASE_URL" } +``` + +### Gemini: Project-Specific Extension + +```json +# .gemini/settings.json +{ + "theme": "github", + "mcpServers": { + "project-tools": { + "command": "./scripts/mcp-server.js", + "trust": true + } + }, + "excludeTools": ["web_search"], + "contextFileName": "PROJECT_CONTEXT.md" +} +``` + +## Future Extensibility + +### Codex Roadmap Indicators + +- **Custom Trusted Commands**: User-defined safe commands +- **Enhanced Protocol**: More sophisticated communication patterns +- **Multi-model Support**: Improved model provider ecosystem +- **Plugin API**: Potential for more formal plugin architecture + +### Gemini Current Trajectory + +- **Extension Registry**: Centralized extension discovery +- **Advanced Telemetry**: Enhanced monitoring and analytics +- **Tool Ecosystem**: Growing library of community tools +- **API Improvements**: More powerful tool development APIs + +## Conclusion + +Both Codex and Gemini offer robust customization and extensibility, but serve different developer needs: + +**Codex excels for developers who prioritize**: + +- Security and sandboxing +- Performance and resource efficiency +- Protocol-driven architecture +- MCP-based extensibility + +**Gemini excels for developers who prioritize**: + +- Rich built-in functionality +- User experience and interface design +- Rapid development workflows +- Comprehensive tool ecosystem + +The choice between them depends on your specific requirements for security, performance, built-in capabilities, and architectural preferences. Both represent mature, well-designed approaches to AI-powered development tools with strong extensibility stories through MCP and their respective configuration systems. diff --git a/docs/eval-framework-cli-transformation-plan.md b/docs/eval-framework-cli-transformation-plan.md new file mode 100644 index 0000000..6d78b2f --- /dev/null +++ b/docs/eval-framework-cli-transformation-plan.md @@ -0,0 +1,308 @@ +# Plan: Transform Gemini CLI into an AI Evaluation Recommendation Tool + +## **Concept Overview** + +Transform Gemini CLI into a specialized tool that analyzes project codebases and recommends tailored evaluation strategies using your comprehensive eval-framework. + +## **Why This Could Work Exceptionally Well** + +Your eval-framework is incredibly comprehensive and structured - it's essentially a decision tree system that maps: + +- Task types → Quality dimensions → Metrics → Implementation approaches +- Budget constraints → Resource allocation → Tool selection +- Evaluation context → Approach mix (automated/LLM-judge/human) + +This structured knowledge is perfect for automation. Instead of developers manually navigating decision trees, the tool would: + +1. **Analyze their codebase** to understand what they're building +2. **Apply the decision trees automatically** based on detected patterns +3. **Generate concrete recommendations** with implementation code + +## **Why Gemini CLI Over Codex** + +- **Superior Code Analysis**: Built-in `read_many_files`, `glob`, `grep` tools for codebase analysis +- **Flexible TypeScript Architecture**: Easier to implement complex recommendation logic +- **Rich Extension System**: Can add custom evaluation-specific tools +- **Better File Operations**: Native support for analyzing large codebases +- **1M+ token context**: Can analyze entire projects at once +- **Web Search Integration**: Built-in `web_search` tool for real-time research and validation + +## **Core Functionality** + +### 1. **Project Analysis Engine** + +- Scan codebase to identify AI/ML patterns (model calls, embeddings, RAG systems) +- Detect task types (Q&A, code generation, creative writing, summarization) +- Analyze existing evaluation code/tests +- Identify tech stack and deployment patterns +- Extract business context from documentation + +### 2. **Evaluation Strategy Recommender** + +- Map detected patterns to eval-framework decision trees +- Recommend priority metrics based on task type and context +- Suggest evaluation approaches (automated/LLM-judge/human mix) +- Provide budget allocation guidance +- Consider existing infrastructure and team capabilities +- **Real-time Research**: Use web search to validate recommendations against latest best practices +- **Cost Validation**: Search for current pricing of evaluation tools and services + +### 3. **Implementation Generator** + +- Generate evaluation code templates tailored to the project +- Create monitoring dashboards configs +- Suggest tool integrations (RAGAS, TruLens, etc.) +- Output complete evaluation pipelines +- Generate documentation and setup guides + +## **Implementation Plan** + +### **Phase 1: Core Analysis Tools** (2-3 weeks) + +- Create `analyze_ai_project` tool to scan codebases for AI patterns +- Implement decision tree logic from eval-framework as structured data +- Build metric recommendation engine +- Add quality dimension mapping functionality + +### **Phase 2: Recommendation System** (2-3 weeks) + +- Integrate eval-framework decision trees as structured data +- Build budget optimization logic based on cost-benefit calculator +- Create evaluation approach selector (automated/LLM-judge/human mix) +- Implement project-specific customization logic + +### **Phase 3: Code Generation** (2-3 weeks) + +- Template engine for evaluation code generation +- Pipeline configuration generators +- Integration templates for popular eval tools +- Monitoring setup automation +- Dashboard and alerting configurations + +### **Phase 4: Integration & Polish** (1-2 weeks) + +- Custom tools integration and testing +- MCP server for evaluation tools +- Documentation and comprehensive examples +- Validation against real projects + +## **Web Search Integration Strategy** + +### **Real-time Validation & Research** + +#### **1. Evaluation Methodology Updates** + +```typescript +// Search for latest research on specific evaluation approaches +await web_search("RAG evaluation best practices 2025 latest research papers"); +await web_search("LLM-as-judge evaluation reliability studies 2024-2025"); +await web_search("automated evaluation metrics correlation human judgment"); +``` + +#### **2. Tool & Service Pricing Validation** + +```typescript +// Validate current pricing and availability +await web_search("RAGAS evaluation framework pricing cost 2025"); +await web_search("OpenAI API pricing evaluation costs per token"); +await web_search("Anthropic Claude API pricing evaluation workloads"); +await web_search("LangSmith evaluation platform pricing plans"); +``` + +#### **3. Market Research for Tool Recommendations** + +```typescript +// Research latest tools and comparisons +await web_search("best LLM evaluation tools 2025 comparison matrix"); +await web_search( + "TruLens vs RAGAS vs LangSmith evaluation platform comparison", +); +await web_search("open source LLM evaluation frameworks 2025"); +``` + +### **Dynamic Recommendation Enhancement** + +#### **Web-Enhanced Decision Making** + +- **Base Recommendations**: Start with eval-framework decision trees +- **Web Validation**: Search for recent developments that might change recommendations +- **Cost Updates**: Get current pricing to refine budget estimates +- **Tool Availability**: Verify recommended tools are still maintained and available + +#### **Search Query Templates** + +```typescript +const searchQueries = { + taskSpecific: (taskType) => + `${taskType} evaluation best practices 2025 latest research`, + toolPricing: (tool) => `${tool} pricing cost evaluation workloads 2025`, + toolComparison: (tools) => + `${tools.join(" vs ")} comparison evaluation framework`, + industryTrends: (domain) => `${domain} AI evaluation trends 2025 metrics`, + budgetBenchmarks: (taskType, scale) => + `${taskType} evaluation budget ${scale} team cost analysis`, +}; +``` + +### **Enhanced Usage Examples with Web Search** + +#### **Live Market Research** + +```bash +gemini "analyze this customer support chatbot project, research current RAG evaluation trends, and recommend an evaluation strategy with current tool pricing" +``` + +#### **Competitive Analysis** + +```bash +gemini "what evaluation approach are leading companies using for code generation in 2025, and how does it compare to my current setup?" +``` + +#### **Budget Optimization with Real Data** + +```bash +gemini "research current evaluation service pricing and optimize my $10k evaluation budget for maximum ROI" +``` + +### **Web Search Integration Benefits** + +#### **Always Current Recommendations** + +- Recommendations reflect latest research and industry trends +- Tool suggestions based on current availability and community adoption +- Pricing estimates use real-time market data + +#### **Market Intelligence** + +- Understand how evaluation practices are evolving +- Identify emerging tools and methodologies +- Benchmark against industry standards + +#### **Risk Mitigation** + +- Avoid recommending discontinued or deprecated tools +- Validate that suggested approaches are still considered best practice +- Ensure cost estimates reflect current market rates + +## **Technical Architecture** + +### **New Custom Tools** + +```typescript +- analyze_ai_project: Detect AI patterns, task types, existing eval code +- recommend_metrics: Apply decision tree logic from eval-framework +- generate_eval_plan: Create comprehensive evaluation strategy +- generate_eval_code: Output implementation templates +- estimate_eval_costs: Budget and resource planning with live pricing +- setup_monitoring: Configure production monitoring +- research_eval_trends: Web search for latest evaluation methodologies +- validate_tool_recommendations: Check current availability and pricing of eval tools +``` + +### **Extension Structure** + +``` +.gemini/extensions/eval-framework/ +├── index.json (extension manifest) +├── decision-trees/ (structured eval-framework data) +├── quality-dimensions/ (metric definitions and thresholds) +├── templates/ (code generation templates) +├── cost-models/ (budget optimization logic) +└── tools/ (custom evaluation tools) +``` + +### **Data Integration** + +Convert eval-framework markdown into structured JSON: + +- Decision trees → Executable logic +- Quality dimensions → Metric definitions +- Cost models → Budget calculators +- Implementation guides → Code templates + +## **Usage Examples** + +### **Basic Analysis** + +```bash +gemini "analyze this AI project and recommend an evaluation strategy" +``` + +### **Specific Task Focus** + +```bash +gemini "this is a RAG system for customer support - what evaluation approach should I use?" +``` + +### **Implementation Generation** + +```bash +gemini "generate evaluation code for the recommended metrics with monitoring setup" +``` + +### **Budget Planning** + +```bash +gemini "estimate evaluation costs for this project with a $5k monthly budget" +``` + +### **Real-time Research & Validation** + +```bash +gemini "what are the latest evaluation trends for RAG systems in 2025?" +``` + +```bash +gemini "validate that RAGAS is still the best tool for RAG evaluation and check current pricing" +``` + +## **Key Benefits** + +### **For Developers** + +- **Automated Expertise**: Apply eval-framework knowledge without manual navigation +- **Project-Specific**: Tailored recommendations vs generic evaluation advice +- **Implementation Ready**: Generate actual code, not just recommendations +- **Cost Optimized**: Budget-aware metric selection with live pricing data +- **Always Current**: Real-time research ensures recommendations reflect latest best practices + +### **For Organizations** + +- **Consistency**: Standardized evaluation approaches across teams +- **Time Savings**: Weeks to hours for evaluation setup +- **Quality Improvement**: Evidence-based metric selection +- **ROI Optimization**: Better resource allocation through systematic planning + +## **Technical Feasibility** + +### **High Confidence Areas** + +- **Codebase Analysis**: Gemini's file tools excel at pattern detection +- **Decision Tree Logic**: Straightforward to implement as code +- **Template Generation**: Well-established patterns in CLI tools +- **Integration**: MCP provides clean extension points + +### **Medium Confidence Areas** + +- **Quality of Recommendations**: Depends on pattern detection accuracy +- **Code Generation Quality**: Templates need extensive testing +- **Web Search Quality**: Effectiveness depends on search result relevance and accuracy + +## **Success Metrics** + +- **Time Reduction**: Evaluation setup time reduced from weeks to hours +- **Quality Improvement**: Higher quality metric selection vs manual approaches +- **Cost Optimization**: Improved evaluation ROI through better resource allocation +- **Adoption**: Increased use of systematic evaluation practices + +## **Competitive Advantage** + +This would be the **first AI-powered evaluation consultant** that: + +- Understands both your specific project AND evaluation best practices +- Provides concrete implementation, not just advice +- Optimizes for real-world constraints (budget, team size, timeline) +- Stays current with latest evaluation research via web search integration +- Validates recommendations against real-time market data and pricing + +The combination of your comprehensive eval-framework with Gemini's codebase analysis capabilities could create a genuinely transformative tool for AI evaluation. diff --git a/docs/eval-framework-mcp-server-spec.md b/docs/eval-framework-mcp-server-spec.md new file mode 100644 index 0000000..b313577 --- /dev/null +++ b/docs/eval-framework-mcp-server-spec.md @@ -0,0 +1,426 @@ +# Eval Framework MCP Server Specification + +## **Overview** + +A specialized MCP server that provides AI evaluation strategy recommendations by combining structured evaluation knowledge with real-time research capabilities. The server leverages existing MCP tools (web search, file operations) through prompts while providing specialized evaluation tools. + +## **Architecture Philosophy** + +### **MCP Server Tools**: Specialized evaluation logic only + +- Project analysis and pattern detection +- Evaluation strategy recommendation +- Code generation and templates +- Cost estimation and budget optimization + +### **Built-in Tool Integration**: Leverage existing capabilities via prompts + +- Web search for real-time validation +- File system operations for codebase analysis +- Code generation assistance + +## **Core MCP Server Tools** + +### **1. `analyze_ai_project`** + +**Purpose**: Detect AI patterns and classify project characteristics + +```typescript +{ + "name": "analyze_ai_project", + "description": "Analyze project structure to identify AI/ML patterns, task types, and existing evaluation approaches", + "inputSchema": { + "type": "object", + "properties": { + "project_summary": { + "type": "string", + "description": "Summary of project files and code patterns (obtained via file reading tools)" + }, + "business_context": { + "type": "string", + "description": "Optional business context about the project goals" + }, + "existing_eval_code": { + "type": "string", + "description": "Any existing evaluation code found in the project" + } + }, + "required": ["project_summary"] + } +} +``` + +**Output**: + +```json +{ + "task_types": ["rag", "qa", "code_generation"], + "ai_frameworks": ["langchain", "openai"], + "deployment_context": "production", + "team_size_estimate": "small", + "existing_evaluation": { + "has_tests": true, + "frameworks_used": ["pytest"], + "coverage": "basic" + }, + "complexity_score": 7, + "risk_factors": ["customer_facing", "high_volume"] +} +``` + +### **2. `recommend_evaluation_strategy`** + +**Purpose**: Apply decision tree logic to recommend evaluation approach + +```typescript +{ + "name": "recommend_evaluation_strategy", + "description": "Generate comprehensive evaluation strategy based on project analysis and constraints", + "inputSchema": { + "type": "object", + "properties": { + "project_analysis": { + "type": "object", + "description": "Output from analyze_ai_project" + }, + "budget_constraint": { + "type": "number", + "description": "Monthly evaluation budget in USD" + }, + "timeline": { + "type": "string", + "enum": ["immediate", "1-month", "3-months", "6-months"] + }, + "team_expertise": { + "type": "string", + "enum": ["beginner", "intermediate", "advanced"] + }, + "quality_requirements": { + "type": "string", + "enum": ["basic", "high", "critical"] + } + }, + "required": ["project_analysis"] + } +} +``` + +**Output**: + +```json +{ + "recommended_metrics": [ + { + "name": "answer_faithfulness", + "priority": "high", + "weight": 0.25, + "measurement_approach": "ragas + llm_judge", + "target_threshold": 0.95, + "estimated_cost_monthly": 800 + } + ], + "evaluation_mix": { + "automated": 0.6, + "llm_judge": 0.3, + "human": 0.1 + }, + "implementation_phases": [ + { + "phase": "immediate", + "metrics": ["basic_accuracy", "response_time"], + "effort_estimate": "1-2 days" + } + ], + "budget_allocation": { + "tooling": 1200, + "compute": 800, + "human_annotation": 400 + }, + "risk_assessment": "medium" +} +``` + +### **3. `generate_evaluation_implementation`** + +**Purpose**: Generate concrete implementation code and configurations + +```typescript +{ + "name": "generate_evaluation_implementation", + "description": "Generate evaluation code, configs, and monitoring setup based on strategy", + "inputSchema": { + "type": "object", + "properties": { + "evaluation_strategy": { + "type": "object", + "description": "Output from recommend_evaluation_strategy" + }, + "tech_stack": { + "type": "object", + "properties": { + "language": {"type": "string"}, + "frameworks": {"type": "array", "items": {"type": "string"}}, + "deployment": {"type": "string"} + } + }, + "output_format": { + "type": "string", + "enum": ["complete_implementation", "getting_started", "integration_only"] + } + }, + "required": ["evaluation_strategy", "tech_stack"] + } +} +``` + +**Output**: + +```json +{ + "implementation_files": [ + { + "filename": "evaluation/rag_evaluator.py", + "content": "# Generated evaluation code...", + "description": "Main RAG evaluation implementation using RAGAS" + }, + { + "filename": "monitoring/dashboard_config.yaml", + "content": "# Grafana dashboard config...", + "description": "Monitoring dashboard configuration" + } + ], + "setup_instructions": "Step-by-step setup guide...", + "dependencies": { + "python": ["ragas>=0.1.0", "langchain", "openai"], + "infrastructure": ["prometheus", "grafana"] + }, + "next_steps": [ + "Run initial evaluation on sample data", + "Set up monitoring dashboard", + "Configure alerting thresholds" + ] +} +``` + +### **4. `estimate_evaluation_costs`** + +**Purpose**: Provide detailed cost breakdown and ROI analysis + +```typescript +{ + "name": "estimate_evaluation_costs", + "description": "Calculate detailed cost estimates and ROI projections for evaluation strategy", + "inputSchema": { + "type": "object", + "properties": { + "evaluation_strategy": { + "type": "object", + "description": "Output from recommend_evaluation_strategy" + }, + "scale_parameters": { + "type": "object", + "properties": { + "requests_per_month": {"type": "number"}, + "evaluation_frequency": {"type": "string"}, + "team_size": {"type": "number"} + } + }, + "current_costs": { + "type": "object", + "description": "Optional current evaluation costs for comparison" + } + }, + "required": ["evaluation_strategy", "scale_parameters"] + } +} +``` + +## **Integration with Built-in Tools via Prompts** + +### **1. Project Analysis Prompt Pattern** + +Instead of building file reading into the MCP server, use prompts that leverage existing tools: + +```typescript +// MCP Server returns a suggested prompt for the AI assistant: +{ + "analysis_prompt": `Please analyze this project for AI evaluation planning: + +1. Use read_many_files to scan the codebase for: + - Model API calls (OpenAI, Anthropic, etc.) + - RAG implementations (vector stores, retrieval) + - Evaluation or testing code + - Documentation about AI features + +2. Use glob to find relevant files: + - "**/*.py" for Python AI code + - "**/requirements.txt" for dependencies + - "**/README.md" for project documentation + - "**/*test*.py" for existing evaluation code + +3. Summarize findings and call analyze_ai_project with the summary.`, + + "file_patterns": [ + "**/*.py", + "**/requirements.txt", + "**/package.json", + "**/README.md", + "**/*test*.py", + "**/*eval*.py" + ] +} +``` + +### **2. Research Validation Prompt Pattern** + +Leverage web search for real-time validation: + +```typescript +{ + "research_prompt": `Validate and enhance these evaluation recommendations: + +Current Recommendations: ${JSON.stringify(recommendations)} + +Please web_search for: +1. "${task_type} evaluation best practices 2025 latest research" +2. "${recommended_tools.join(' vs ')} comparison evaluation frameworks" +3. "evaluation budget ${budget_range} ${task_type} ROI analysis" + +Update recommendations based on findings and note any changes from current best practices.`, + + "search_queries": [ + "RAG evaluation best practices 2025 latest research", + "RAGAS vs TruLens vs LangSmith evaluation platform comparison", + "evaluation budget $5000 RAG system ROI analysis" + ] +} +``` + +### **3. Implementation Generation Prompt Pattern** + +Use AI assistant for code generation with templates: + +```typescript +{ + "implementation_prompt": `Generate evaluation implementation based on this strategy: + +Strategy: ${JSON.stringify(evaluation_strategy)} +Tech Stack: ${JSON.stringify(tech_stack)} + +Please: +1. Create evaluation code using the recommended metrics and frameworks +2. Include proper error handling and logging +3. Add monitoring and alerting configurations +4. Generate setup documentation +5. Include sample usage examples + +Use the implementation templates from the MCP server as starting points.`, + + "templates": { + "python_ragas": "# Template code for RAGAS evaluation...", + "monitoring_config": "# Template monitoring configuration...", + "setup_guide": "# Template setup instructions..." + } +} +``` + +## **Server Implementation Structure** + +``` +eval-framework-mcp-server/ +├── package.json +├── src/ +│ ├── server.ts # MCP server entry point +│ ├── tools/ +│ │ ├── analyze-project.ts # Project analysis logic +│ │ ├── recommend-strategy.ts # Decision tree application +│ │ ├── generate-impl.ts # Code generation +│ │ └── estimate-costs.ts # Cost calculation +│ ├── data/ +│ │ ├── decision-trees.json # Eval framework decision logic +│ │ ├── quality-dimensions.json +│ │ ├── cost-models.json # Pricing data and formulas +│ │ └── templates/ # Implementation templates +│ │ ├── python/ +│ │ ├── typescript/ +│ │ └── monitoring/ +│ ├── prompts/ +│ │ ├── analysis-prompts.ts # Prompt templates for built-in tools +│ │ ├── research-prompts.ts +│ │ └── implementation-prompts.ts +│ └── utils/ +│ ├── pattern-detector.ts # AI pattern recognition +│ ├── metric-selector.ts # Decision tree logic +│ └── cost-calculator.ts # Budget optimization +├── README.md +└── examples/ + ├── basic-usage.md + └── integration-examples/ +``` + +## **Usage Flow** + +### **1. Initial Analysis** + +```bash +# User: "Analyze this project and recommend evaluation approach" + +# AI Assistant workflow: +1. Uses built-in tools to scan codebase (read_many_files, glob) +2. Calls analyze_ai_project with file summary +3. Calls recommend_evaluation_strategy with analysis + user constraints +4. Uses web_search to validate recommendations against latest trends +5. Presents comprehensive evaluation strategy +``` + +### **2. Implementation Generation** + +```bash +# User: "Generate the evaluation code for this strategy" + +# AI Assistant workflow: +1. Calls generate_evaluation_implementation with strategy +2. Uses returned templates and guidance to create complete implementation +3. Generates setup documentation and deployment configs +4. Provides step-by-step implementation guide +``` + +### **3. Cost Analysis** + +```bash +# User: "What will this evaluation approach cost with 10k requests/month?" + +# AI Assistant workflow: +1. Calls estimate_evaluation_costs with scale parameters +2. Uses web_search to validate current tool pricing +3. Presents detailed cost breakdown and ROI analysis +4. Suggests optimizations for budget constraints +``` + +## **Key Benefits of This Architecture** + +### **1. Focused Responsibility** + +- MCP server handles evaluation-specific logic only +- Leverages existing tools for general operations +- Clean separation of concerns + +### **2. Flexibility** + +- Works with any MCP-compatible client +- Can adapt to different built-in tool capabilities +- Easy to extend with new evaluation methodologies + +### **3. Maintainability** + +- Evaluation logic isolated and versioned separately +- No dependency on specific CLI implementations +- Community can contribute evaluation knowledge + +### **4. Real-time Adaptation** + +- Web search integration keeps recommendations current +- Can adapt to new tools and pricing +- Validates against latest research + +This architecture provides the best of both worlds: specialized evaluation expertise through the MCP server, combined with the full power of general-purpose AI tools for research, file operations, and implementation assistance.