diff --git a/docs/codex-vs-gemini-comparison.md b/docs/codex-vs-gemini-comparison.md
new file mode 100644
index 0000000..19ad828
--- /dev/null
+++ b/docs/codex-vs-gemini-comparison.md
@@ -0,0 +1,344 @@
+# Codex vs Gemini CLI: Customization and Extensibility Comparison
+
+A comprehensive analysis of two leading AI-powered CLI tools, focusing on their customization capabilities and extensibility features.
+
+## Executive Summary
+
+**Codex CLI** (OpenAI) and **Gemini CLI** (Google) represent two distinct approaches to AI-powered command-line development tools. While both prioritize customization and extensibility, they employ fundamentally different architectural philosophies:
+
+- **Codex** emphasizes a **minimalist, protocol-driven architecture** with strong security guarantees
+- **Gemini** focuses on a **feature-rich, user-friendly experience** with comprehensive built-in capabilities
+
+## Project Overview
+
+### Codex CLI (OpenAI)
+
+- **Language**: Rust
+- **Philosophy**: Lightweight coding agent with security-first design
+- **Target**: Terminal-focused developers who prioritize performance and security
+- **Architecture**: Protocol-driven client-server with multiple UI frontends
+
+### Gemini CLI (Google)
+
+- **Language**: TypeScript/Node.js
+- **Philosophy**: Comprehensive AI workflow tool with rich feature set
+- **Target**: Developers working with large codebases and complex workflows
+- **Architecture**: Package-based modular system with React-powered terminal UI
+
+## Customization Capabilities
+
+### Configuration Systems
+
+| Aspect                    | Codex                            | Gemini                                              |
+| ------------------------- | -------------------------------- | --------------------------------------------------- |
+| **Format**                | TOML with comments               | JSON with environment variable expansion            |
+| **Locations**             | `~/.codex/config.toml`           | `~/.gemini/settings.json` + `.gemini/settings.json` |
+| **Environment Variables** | Limited (`CODEX_HOME`, API keys) | Extensive (`$VAR` expansion, hierarchical `.env`)   |
+| **Profiles**              | ✅ Named profile system          | ❌ No profiles (workspace settings instead)         |
+| **Runtime Changes**       | ❌ Requires restart              | ✅ Dynamic updates supported                        |
+
+#### Codex Configuration Strengths
+
+- **Profile System**: Switch between complete configuration sets
+- **TOML Format**: Human-readable with comment support
+- **Strong Typing**: Rust-based validation prevents configuration errors
+- **Hierarchical Precedence**: CLI flags → profiles → base config → defaults
+
+#### Gemini Configuration Strengths
+
+- **Two-Tier Settings**: User and workspace-specific configurations
+- **Dynamic Updates**: Settings can be modified during runtime
+- **Environment Integration**: Extensive variable expansion and `.env` support
+- **Hierarchical Context**: Multi-level context file inheritance
+
+### Model and Provider Customization
+
+#### Codex
+
+```toml
+[model_providers.custom-llm]
+name = "Custom LLM Provider"
+base_url = "https://api.custom.com/v1"
+env_key = "CUSTOM_API_KEY"
+wire_api = "chat"
+query_params = { api-version = "2025-04-01" }
+
+[profiles.development]
+model = "gpt-4"
+approval_policy = "untrusted"
+sandbox = "workspace-write"
+```
+
+#### Gemini
+
+```json
+{
+  "theme": "atom-one",
+  "coreTools": ["file_system", "shell", "web"],
+  "mcpServers": {
+    "custom-server": {
+      "command": "python",
+      "args": ["server.py"],
+      "trust": true
+    }
+  }
+}
+```
+
+## Extensibility Features
+
+### Extension Architecture Comparison
+
+| Feature                 | Codex                         | Gemini                            |
+| ----------------------- | ----------------------------- | --------------------------------- |
+| **Plugin System**       | ❌ Configuration-based only   | ✅ Directory-based extensions     |
+| **MCP Support**         | ✅ Native Rust implementation | ✅ Official SDK integration       |
+| **Custom Tools**        | Via MCP servers only          | MCP + custom tool discovery       |
+| **Extension Discovery** | Configuration files           | `.gemini/extensions/` directories |
+| **API Access**          | Protocol-based communication  | Direct TypeScript interfaces      |
+
+### Model Context Protocol (MCP) Integration
+
+Both projects heavily leverage MCP for extensibility, but with different implementations:
+
+#### Codex MCP Features
+
+- **Native Rust Implementation**: Custom MCP client/server in Rust
+- **Connection Manager**: Sophisticated multi-server management
+- **Tool Namespacing**: Prevents conflicts with `server__OAI_CODEX_MCP__tool` format
+- **Security**: Sandboxed server execution with timeout management
+- **Performance**: Concurrent server spawning and tool aggregation
+
+#### Gemini MCP Features
+
+- **Official SDK**: Uses standard MCP TypeScript SDK
+- **Multiple Transports**: Stdio, SSE, and HTTP support
+- **Trust Model**: Trust-based confirmation bypass for known servers
+- **Extension Integration**: MCP servers can be defined per-extension
+- **Tool Registry**: Centralized tool discovery and registration
+
+### Custom Tool Development
+
+#### Codex Approach
+
+```rust
+// Tools must be implemented as MCP servers
+// Example: Custom file processor MCP server
+pub struct CustomTool {
+    name: String,
+    description: String,
+}
+
+impl McpTool for CustomTool {
+    async fn execute(&self, params: serde_json::Value) -> Result<ToolResult> {
+        // Custom tool implementation
+    }
+}
+```
+
+#### Gemini Approach
+
+```typescript
+// Built-in tool interface
+export class CustomFileTool implements Tool<CustomParams, CustomResult> {
+  name = "custom_file_tool";
+  displayName = "Custom File Tool";
+  description = "Processes files with custom logic";
+
+  async execute(params: CustomParams): Promise<CustomResult> {
+    // Direct tool implementation
+  }
+}
+
+// Or via tool discovery command
+{
+  "toolDiscoveryCommand": "python discover_tools.py",
+  "toolCallCommand": "python execute_tool.py"
+}
+```
+
+### Extension Examples
+
+#### Codex Extensions (via MCP)
+
+- **Custom Model Providers**: Add any OpenAI-compatible API
+- **External Tools**: Database clients, cloud providers, monitoring systems
+- **Notification Integrations**: Slack, Discord, email notifications
+- **Development Tools**: Linters, formatters, test runners
+
+#### Gemini Extensions
+
+```json
+{
+  "name": "development-tools",
+  "version": "1.0.0",
+  "mcpServers": {
+    "linter": { "command": "eslint-mcp-server" },
+    "formatter": { "command": "prettier-mcp-server" }
+  },
+  "excludeTools": ["web_search"],
+  "contextFileName": "DEV_CONTEXT.md"
+}
+```
+
+## Security and Sandboxing
+
+### Codex Security Model
+
+- **Multi-layered Approval**: Separate workflows for commands and file patches
+- **Native Sandboxing**: Landlock (Linux) and Seatbelt (macOS) integration
+- **Execution Policies**: Configurable safety checks and command restrictions
+- **MCP Isolation**: Server process isolation with resource limits
+
+### Gemini Security Model
+
+- **Confirmation-based**: User prompts for destructive operations
+- **Trust System**: Bypass confirmations for trusted MCP servers
+- **Docker Sandboxing**: Optional containerized execution environment
+- **Allowlisting**: Server and tool-level access controls
+
+## Performance and Architecture
+
+### System Architecture
+
+#### Codex: Protocol-Driven Design
+
+```
+┌─────────────┐    Protocol    ┌──────────────┐
+│ TUI/CLI/Exec│ ◄──────────── │ Core Library │
+└─────────────┘               └──────────────┘
+                                      │
+                               ┌──────▼──────┐
+                               │ MCP Manager │
+                               └─────────────┘
+```
+
+**Benefits:**
+
+- Transport-agnostic communication
+- Multiple UI implementations possible
+- Clear separation of concerns
+- Native performance with zero-cost abstractions
+
+#### Gemini: Package-Based Modularity
+
+```
+┌─────────────┐    Direct Calls    ┌──────────────┐
+│ CLI Package │ ◄────────────────► │ Core Package │
+└─────────────┘                   └──────────────┘
+                                          │
+                                   ┌──────▼──────┐
+                                   │ Tool Registry│
+                                   └─────────────┘
+```
+
+**Benefits:**
+
+- Flexible tool registration
+- Modern TypeScript ecosystem
+- React-based terminal UI
+- Comprehensive built-in toolset
+
+### Performance Characteristics
+
+| Metric                    | Codex                   | Gemini                      |
+| ------------------------- | ----------------------- | --------------------------- |
+| **Startup Time**          | Fast (native binary)    | Moderate (Node.js runtime)  |
+| **Memory Usage**          | Low (Rust efficiency)   | Higher (JavaScript runtime) |
+| **Tool Execution**        | Fast (native + MCP)     | Good (async/await patterns) |
+| **Concurrent Operations** | Excellent (Rust async)  | Good (Node.js event loop)   |
+| **File Operations**       | Moderate (MCP overhead) | Excellent (built-in tools)  |
+
+## Use Case Recommendations
+
+### Choose Codex If You Need:
+
+- **Maximum Security**: Native sandboxing and strict approval workflows
+- **Performance**: Native execution with minimal resource usage
+- **Flexibility**: Protocol-driven architecture supporting multiple frontends
+- **Extensibility via MCP**: Rich ecosystem of external MCP servers
+- **Terminal-First**: Optimized for terminal-based development workflows
+
+### Choose Gemini If You Need:
+
+- **Rich Built-in Tools**: Comprehensive file system and web capabilities
+- **User-Friendly UX**: Modern terminal UI with React components
+- **Large Codebase Support**: 1M+ token context windows
+- **Rapid Prototyping**: Extensive built-in functionality reduces setup time
+- **Google Ecosystem**: Integration with Google services and authentication
+
+## Customization Examples
+
+### Codex: Creating a Development Profile
+
+```toml
+# ~/.codex/config.toml
+[profiles.development]
+model = "gpt-4"
+approval_policy = "untrusted"
+sandbox = "workspace-write"
+notify = ["osascript", "-e", "display notification \"Task completed\""]
+
+[profiles.production]
+model = "o1"
+approval_policy = "on-failure"
+sandbox = "read-only"
+
+[mcp_servers.database]
+command = "python"
+args = ["-m", "database_mcp_server"]
+env = { DB_URL = "$DATABASE_URL" }
+```
+
+### Gemini: Project-Specific Extension
+
+```json
+# .gemini/settings.json
+{
+  "theme": "github",
+  "mcpServers": {
+    "project-tools": {
+      "command": "./scripts/mcp-server.js",
+      "trust": true
+    }
+  },
+  "excludeTools": ["web_search"],
+  "contextFileName": "PROJECT_CONTEXT.md"
+}
+```
+
+## Future Extensibility
+
+### Codex Roadmap Indicators
+
+- **Custom Trusted Commands**: User-defined safe commands
+- **Enhanced Protocol**: More sophisticated communication patterns
+- **Multi-model Support**: Improved model provider ecosystem
+- **Plugin API**: Potential for more formal plugin architecture
+
+### Gemini Current Trajectory
+
+- **Extension Registry**: Centralized extension discovery
+- **Advanced Telemetry**: Enhanced monitoring and analytics
+- **Tool Ecosystem**: Growing library of community tools
+- **API Improvements**: More powerful tool development APIs
+
+## Conclusion
+
+Both Codex and Gemini offer robust customization and extensibility, but serve different developer needs:
+
+**Codex excels for developers who prioritize**:
+
+- Security and sandboxing
+- Performance and resource efficiency
+- Protocol-driven architecture
+- MCP-based extensibility
+
+**Gemini excels for developers who prioritize**:
+
+- Rich built-in functionality
+- User experience and interface design
+- Rapid development workflows
+- Comprehensive tool ecosystem
+
+The choice between them depends on your specific requirements for security, performance, built-in capabilities, and architectural preferences. Both represent mature, well-designed approaches to AI-powered development tools with strong extensibility stories through MCP and their respective configuration systems.
diff --git a/docs/eval-framework-cli-transformation-plan.md b/docs/eval-framework-cli-transformation-plan.md
new file mode 100644
index 0000000..6d78b2f
--- /dev/null
+++ b/docs/eval-framework-cli-transformation-plan.md
@@ -0,0 +1,308 @@
+# Plan: Transform Gemini CLI into an AI Evaluation Recommendation Tool
+
+## **Concept Overview**
+
+Transform Gemini CLI into a specialized tool that analyzes project codebases and recommends tailored evaluation strategies using your comprehensive eval-framework.
+
+## **Why This Could Work Exceptionally Well**
+
+Your eval-framework is incredibly comprehensive and structured - it's essentially a decision tree system that maps:
+
+- Task types → Quality dimensions → Metrics → Implementation approaches
+- Budget constraints → Resource allocation → Tool selection
+- Evaluation context → Approach mix (automated/LLM-judge/human)
+
+This structured knowledge is perfect for automation. Instead of developers manually navigating decision trees, the tool would:
+
+1. **Analyze their codebase** to understand what they're building
+2. **Apply the decision trees automatically** based on detected patterns
+3. **Generate concrete recommendations** with implementation code
+
+## **Why Gemini CLI Over Codex**
+
+- **Superior Code Analysis**: Built-in `read_many_files`, `glob`, `grep` tools for codebase analysis
+- **Flexible TypeScript Architecture**: Easier to implement complex recommendation logic
+- **Rich Extension System**: Can add custom evaluation-specific tools
+- **Better File Operations**: Native support for analyzing large codebases
+- **1M+ token context**: Can analyze entire projects at once
+- **Web Search Integration**: Built-in `web_search` tool for real-time research and validation
+
+## **Core Functionality**
+
+### 1. **Project Analysis Engine**
+
+- Scan codebase to identify AI/ML patterns (model calls, embeddings, RAG systems)
+- Detect task types (Q&A, code generation, creative writing, summarization)
+- Analyze existing evaluation code/tests
+- Identify tech stack and deployment patterns
+- Extract business context from documentation
+
+### 2. **Evaluation Strategy Recommender**
+
+- Map detected patterns to eval-framework decision trees
+- Recommend priority metrics based on task type and context
+- Suggest evaluation approaches (automated/LLM-judge/human mix)
+- Provide budget allocation guidance
+- Consider existing infrastructure and team capabilities
+- **Real-time Research**: Use web search to validate recommendations against latest best practices
+- **Cost Validation**: Search for current pricing of evaluation tools and services
+
+### 3. **Implementation Generator**
+
+- Generate evaluation code templates tailored to the project
+- Create monitoring dashboards configs
+- Suggest tool integrations (RAGAS, TruLens, etc.)
+- Output complete evaluation pipelines
+- Generate documentation and setup guides
+
+## **Implementation Plan**
+
+### **Phase 1: Core Analysis Tools** (2-3 weeks)
+
+- Create `analyze_ai_project` tool to scan codebases for AI patterns
+- Implement decision tree logic from eval-framework as structured data
+- Build metric recommendation engine
+- Add quality dimension mapping functionality
+
+### **Phase 2: Recommendation System** (2-3 weeks)
+
+- Integrate eval-framework decision trees as structured data
+- Build budget optimization logic based on cost-benefit calculator
+- Create evaluation approach selector (automated/LLM-judge/human mix)
+- Implement project-specific customization logic
+
+### **Phase 3: Code Generation** (2-3 weeks)
+
+- Template engine for evaluation code generation
+- Pipeline configuration generators
+- Integration templates for popular eval tools
+- Monitoring setup automation
+- Dashboard and alerting configurations
+
+### **Phase 4: Integration & Polish** (1-2 weeks)
+
+- Custom tools integration and testing
+- MCP server for evaluation tools
+- Documentation and comprehensive examples
+- Validation against real projects
+
+## **Web Search Integration Strategy**
+
+### **Real-time Validation & Research**
+
+#### **1. Evaluation Methodology Updates**
+
+```typescript
+// Search for latest research on specific evaluation approaches
+await web_search("RAG evaluation best practices 2025 latest research papers");
+await web_search("LLM-as-judge evaluation reliability studies 2024-2025");
+await web_search("automated evaluation metrics correlation human judgment");
+```
+
+#### **2. Tool & Service Pricing Validation**
+
+```typescript
+// Validate current pricing and availability
+await web_search("RAGAS evaluation framework pricing cost 2025");
+await web_search("OpenAI API pricing evaluation costs per token");
+await web_search("Anthropic Claude API pricing evaluation workloads");
+await web_search("LangSmith evaluation platform pricing plans");
+```
+
+#### **3. Market Research for Tool Recommendations**
+
+```typescript
+// Research latest tools and comparisons
+await web_search("best LLM evaluation tools 2025 comparison matrix");
+await web_search(
+  "TruLens vs RAGAS vs LangSmith evaluation platform comparison",
+);
+await web_search("open source LLM evaluation frameworks 2025");
+```
+
+### **Dynamic Recommendation Enhancement**
+
+#### **Web-Enhanced Decision Making**
+
+- **Base Recommendations**: Start with eval-framework decision trees
+- **Web Validation**: Search for recent developments that might change recommendations
+- **Cost Updates**: Get current pricing to refine budget estimates
+- **Tool Availability**: Verify recommended tools are still maintained and available
+
+#### **Search Query Templates**
+
+```typescript
+const searchQueries = {
+  taskSpecific: (taskType) =>
+    `${taskType} evaluation best practices 2025 latest research`,
+  toolPricing: (tool) => `${tool} pricing cost evaluation workloads 2025`,
+  toolComparison: (tools) =>
+    `${tools.join(" vs ")} comparison evaluation framework`,
+  industryTrends: (domain) => `${domain} AI evaluation trends 2025 metrics`,
+  budgetBenchmarks: (taskType, scale) =>
+    `${taskType} evaluation budget ${scale} team cost analysis`,
+};
+```
+
+### **Enhanced Usage Examples with Web Search**
+
+#### **Live Market Research**
+
+```bash
+gemini "analyze this customer support chatbot project, research current RAG evaluation trends, and recommend an evaluation strategy with current tool pricing"
+```
+
+#### **Competitive Analysis**
+
+```bash
+gemini "what evaluation approach are leading companies using for code generation in 2025, and how does it compare to my current setup?"
+```
+
+#### **Budget Optimization with Real Data**
+
+```bash
+gemini "research current evaluation service pricing and optimize my $10k evaluation budget for maximum ROI"
+```
+
+### **Web Search Integration Benefits**
+
+#### **Always Current Recommendations**
+
+- Recommendations reflect latest research and industry trends
+- Tool suggestions based on current availability and community adoption
+- Pricing estimates use real-time market data
+
+#### **Market Intelligence**
+
+- Understand how evaluation practices are evolving
+- Identify emerging tools and methodologies
+- Benchmark against industry standards
+
+#### **Risk Mitigation**
+
+- Avoid recommending discontinued or deprecated tools
+- Validate that suggested approaches are still considered best practice
+- Ensure cost estimates reflect current market rates
+
+## **Technical Architecture**
+
+### **New Custom Tools**
+
+```typescript
+- analyze_ai_project: Detect AI patterns, task types, existing eval code
+- recommend_metrics: Apply decision tree logic from eval-framework
+- generate_eval_plan: Create comprehensive evaluation strategy
+- generate_eval_code: Output implementation templates
+- estimate_eval_costs: Budget and resource planning with live pricing
+- setup_monitoring: Configure production monitoring
+- research_eval_trends: Web search for latest evaluation methodologies
+- validate_tool_recommendations: Check current availability and pricing of eval tools
+```
+
+### **Extension Structure**
+
+```
+.gemini/extensions/eval-framework/
+├── index.json (extension manifest)
+├── decision-trees/ (structured eval-framework data)
+├── quality-dimensions/ (metric definitions and thresholds)
+├── templates/ (code generation templates)
+├── cost-models/ (budget optimization logic)
+└── tools/ (custom evaluation tools)
+```
+
+### **Data Integration**
+
+Convert eval-framework markdown into structured JSON:
+
+- Decision trees → Executable logic
+- Quality dimensions → Metric definitions
+- Cost models → Budget calculators
+- Implementation guides → Code templates
+
+## **Usage Examples**
+
+### **Basic Analysis**
+
+```bash
+gemini "analyze this AI project and recommend an evaluation strategy"
+```
+
+### **Specific Task Focus**
+
+```bash
+gemini "this is a RAG system for customer support - what evaluation approach should I use?"
+```
+
+### **Implementation Generation**
+
+```bash
+gemini "generate evaluation code for the recommended metrics with monitoring setup"
+```
+
+### **Budget Planning**
+
+```bash
+gemini "estimate evaluation costs for this project with a $5k monthly budget"
+```
+
+### **Real-time Research & Validation**
+
+```bash
+gemini "what are the latest evaluation trends for RAG systems in 2025?"
+```
+
+```bash
+gemini "validate that RAGAS is still the best tool for RAG evaluation and check current pricing"
+```
+
+## **Key Benefits**
+
+### **For Developers**
+
+- **Automated Expertise**: Apply eval-framework knowledge without manual navigation
+- **Project-Specific**: Tailored recommendations vs generic evaluation advice
+- **Implementation Ready**: Generate actual code, not just recommendations
+- **Cost Optimized**: Budget-aware metric selection with live pricing data
+- **Always Current**: Real-time research ensures recommendations reflect latest best practices
+
+### **For Organizations**
+
+- **Consistency**: Standardized evaluation approaches across teams
+- **Time Savings**: Weeks to hours for evaluation setup
+- **Quality Improvement**: Evidence-based metric selection
+- **ROI Optimization**: Better resource allocation through systematic planning
+
+## **Technical Feasibility**
+
+### **High Confidence Areas**
+
+- **Codebase Analysis**: Gemini's file tools excel at pattern detection
+- **Decision Tree Logic**: Straightforward to implement as code
+- **Template Generation**: Well-established patterns in CLI tools
+- **Integration**: MCP provides clean extension points
+
+### **Medium Confidence Areas**
+
+- **Quality of Recommendations**: Depends on pattern detection accuracy
+- **Code Generation Quality**: Templates need extensive testing
+- **Web Search Quality**: Effectiveness depends on search result relevance and accuracy
+
+## **Success Metrics**
+
+- **Time Reduction**: Evaluation setup time reduced from weeks to hours
+- **Quality Improvement**: Higher quality metric selection vs manual approaches
+- **Cost Optimization**: Improved evaluation ROI through better resource allocation
+- **Adoption**: Increased use of systematic evaluation practices
+
+## **Competitive Advantage**
+
+This would be the **first AI-powered evaluation consultant** that:
+
+- Understands both your specific project AND evaluation best practices
+- Provides concrete implementation, not just advice
+- Optimizes for real-world constraints (budget, team size, timeline)
+- Stays current with latest evaluation research via web search integration
+- Validates recommendations against real-time market data and pricing
+
+The combination of your comprehensive eval-framework with Gemini's codebase analysis capabilities could create a genuinely transformative tool for AI evaluation.
diff --git a/docs/eval-framework-mcp-server-spec.md b/docs/eval-framework-mcp-server-spec.md
new file mode 100644
index 0000000..b313577
--- /dev/null
+++ b/docs/eval-framework-mcp-server-spec.md
@@ -0,0 +1,426 @@
+# Eval Framework MCP Server Specification
+
+## **Overview**
+
+A specialized MCP server that provides AI evaluation strategy recommendations by combining structured evaluation knowledge with real-time research capabilities. The server leverages existing MCP tools (web search, file operations) through prompts while providing specialized evaluation tools.
+
+## **Architecture Philosophy**
+
+### **MCP Server Tools**: Specialized evaluation logic only
+
+- Project analysis and pattern detection
+- Evaluation strategy recommendation
+- Code generation and templates
+- Cost estimation and budget optimization
+
+### **Built-in Tool Integration**: Leverage existing capabilities via prompts
+
+- Web search for real-time validation
+- File system operations for codebase analysis
+- Code generation assistance
+
+## **Core MCP Server Tools**
+
+### **1. `analyze_ai_project`**
+
+**Purpose**: Detect AI patterns and classify project characteristics
+
+```typescript
+{
+  "name": "analyze_ai_project",
+  "description": "Analyze project structure to identify AI/ML patterns, task types, and existing evaluation approaches",
+  "inputSchema": {
+    "type": "object",
+    "properties": {
+      "project_summary": {
+        "type": "string",
+        "description": "Summary of project files and code patterns (obtained via file reading tools)"
+      },
+      "business_context": {
+        "type": "string",
+        "description": "Optional business context about the project goals"
+      },
+      "existing_eval_code": {
+        "type": "string",
+        "description": "Any existing evaluation code found in the project"
+      }
+    },
+    "required": ["project_summary"]
+  }
+}
+```
+
+**Output**:
+
+```json
+{
+  "task_types": ["rag", "qa", "code_generation"],
+  "ai_frameworks": ["langchain", "openai"],
+  "deployment_context": "production",
+  "team_size_estimate": "small",
+  "existing_evaluation": {
+    "has_tests": true,
+    "frameworks_used": ["pytest"],
+    "coverage": "basic"
+  },
+  "complexity_score": 7,
+  "risk_factors": ["customer_facing", "high_volume"]
+}
+```
+
+### **2. `recommend_evaluation_strategy`**
+
+**Purpose**: Apply decision tree logic to recommend evaluation approach
+
+```typescript
+{
+  "name": "recommend_evaluation_strategy",
+  "description": "Generate comprehensive evaluation strategy based on project analysis and constraints",
+  "inputSchema": {
+    "type": "object",
+    "properties": {
+      "project_analysis": {
+        "type": "object",
+        "description": "Output from analyze_ai_project"
+      },
+      "budget_constraint": {
+        "type": "number",
+        "description": "Monthly evaluation budget in USD"
+      },
+      "timeline": {
+        "type": "string",
+        "enum": ["immediate", "1-month", "3-months", "6-months"]
+      },
+      "team_expertise": {
+        "type": "string",
+        "enum": ["beginner", "intermediate", "advanced"]
+      },
+      "quality_requirements": {
+        "type": "string",
+        "enum": ["basic", "high", "critical"]
+      }
+    },
+    "required": ["project_analysis"]
+  }
+}
+```
+
+**Output**:
+
+```json
+{
+  "recommended_metrics": [
+    {
+      "name": "answer_faithfulness",
+      "priority": "high",
+      "weight": 0.25,
+      "measurement_approach": "ragas + llm_judge",
+      "target_threshold": 0.95,
+      "estimated_cost_monthly": 800
+    }
+  ],
+  "evaluation_mix": {
+    "automated": 0.6,
+    "llm_judge": 0.3,
+    "human": 0.1
+  },
+  "implementation_phases": [
+    {
+      "phase": "immediate",
+      "metrics": ["basic_accuracy", "response_time"],
+      "effort_estimate": "1-2 days"
+    }
+  ],
+  "budget_allocation": {
+    "tooling": 1200,
+    "compute": 800,
+    "human_annotation": 400
+  },
+  "risk_assessment": "medium"
+}
+```
+
+### **3. `generate_evaluation_implementation`**
+
+**Purpose**: Generate concrete implementation code and configurations
+
+```typescript
+{
+  "name": "generate_evaluation_implementation",
+  "description": "Generate evaluation code, configs, and monitoring setup based on strategy",
+  "inputSchema": {
+    "type": "object",
+    "properties": {
+      "evaluation_strategy": {
+        "type": "object",
+        "description": "Output from recommend_evaluation_strategy"
+      },
+      "tech_stack": {
+        "type": "object",
+        "properties": {
+          "language": {"type": "string"},
+          "frameworks": {"type": "array", "items": {"type": "string"}},
+          "deployment": {"type": "string"}
+        }
+      },
+      "output_format": {
+        "type": "string",
+        "enum": ["complete_implementation", "getting_started", "integration_only"]
+      }
+    },
+    "required": ["evaluation_strategy", "tech_stack"]
+  }
+}
+```
+
+**Output**:
+
+```json
+{
+  "implementation_files": [
+    {
+      "filename": "evaluation/rag_evaluator.py",
+      "content": "# Generated evaluation code...",
+      "description": "Main RAG evaluation implementation using RAGAS"
+    },
+    {
+      "filename": "monitoring/dashboard_config.yaml",
+      "content": "# Grafana dashboard config...",
+      "description": "Monitoring dashboard configuration"
+    }
+  ],
+  "setup_instructions": "Step-by-step setup guide...",
+  "dependencies": {
+    "python": ["ragas>=0.1.0", "langchain", "openai"],
+    "infrastructure": ["prometheus", "grafana"]
+  },
+  "next_steps": [
+    "Run initial evaluation on sample data",
+    "Set up monitoring dashboard",
+    "Configure alerting thresholds"
+  ]
+}
+```
+
+### **4. `estimate_evaluation_costs`**
+
+**Purpose**: Provide detailed cost breakdown and ROI analysis
+
+```typescript
+{
+  "name": "estimate_evaluation_costs",
+  "description": "Calculate detailed cost estimates and ROI projections for evaluation strategy",
+  "inputSchema": {
+    "type": "object",
+    "properties": {
+      "evaluation_strategy": {
+        "type": "object",
+        "description": "Output from recommend_evaluation_strategy"
+      },
+      "scale_parameters": {
+        "type": "object",
+        "properties": {
+          "requests_per_month": {"type": "number"},
+          "evaluation_frequency": {"type": "string"},
+          "team_size": {"type": "number"}
+        }
+      },
+      "current_costs": {
+        "type": "object",
+        "description": "Optional current evaluation costs for comparison"
+      }
+    },
+    "required": ["evaluation_strategy", "scale_parameters"]
+  }
+}
+```
+
+## **Integration with Built-in Tools via Prompts**
+
+### **1. Project Analysis Prompt Pattern**
+
+Instead of building file reading into the MCP server, use prompts that leverage existing tools:
+
+```typescript
+// MCP Server returns a suggested prompt for the AI assistant:
+{
+  "analysis_prompt": `Please analyze this project for AI evaluation planning:
+
+1. Use read_many_files to scan the codebase for:
+   - Model API calls (OpenAI, Anthropic, etc.)
+   - RAG implementations (vector stores, retrieval)
+   - Evaluation or testing code
+   - Documentation about AI features
+
+2. Use glob to find relevant files:
+   - "**/*.py" for Python AI code
+   - "**/requirements.txt" for dependencies
+   - "**/README.md" for project documentation
+   - "**/*test*.py" for existing evaluation code
+
+3. Summarize findings and call analyze_ai_project with the summary.`,
+
+  "file_patterns": [
+    "**/*.py",
+    "**/requirements.txt",
+    "**/package.json",
+    "**/README.md",
+    "**/*test*.py",
+    "**/*eval*.py"
+  ]
+}
+```
+
+### **2. Research Validation Prompt Pattern**
+
+Leverage web search for real-time validation:
+
+```typescript
+{
+  "research_prompt": `Validate and enhance these evaluation recommendations:
+
+Current Recommendations: ${JSON.stringify(recommendations)}
+
+Please web_search for:
+1. "${task_type} evaluation best practices 2025 latest research"
+2. "${recommended_tools.join(' vs ')} comparison evaluation frameworks"
+3. "evaluation budget ${budget_range} ${task_type} ROI analysis"
+
+Update recommendations based on findings and note any changes from current best practices.`,
+
+  "search_queries": [
+    "RAG evaluation best practices 2025 latest research",
+    "RAGAS vs TruLens vs LangSmith evaluation platform comparison",
+    "evaluation budget $5000 RAG system ROI analysis"
+  ]
+}
+```
+
+### **3. Implementation Generation Prompt Pattern**
+
+Use AI assistant for code generation with templates:
+
+```typescript
+{
+  "implementation_prompt": `Generate evaluation implementation based on this strategy:
+
+Strategy: ${JSON.stringify(evaluation_strategy)}
+Tech Stack: ${JSON.stringify(tech_stack)}
+
+Please:
+1. Create evaluation code using the recommended metrics and frameworks
+2. Include proper error handling and logging
+3. Add monitoring and alerting configurations
+4. Generate setup documentation
+5. Include sample usage examples
+
+Use the implementation templates from the MCP server as starting points.`,
+
+  "templates": {
+    "python_ragas": "# Template code for RAGAS evaluation...",
+    "monitoring_config": "# Template monitoring configuration...",
+    "setup_guide": "# Template setup instructions..."
+  }
+}
+```
+
+## **Server Implementation Structure**
+
+```
+eval-framework-mcp-server/
+├── package.json
+├── src/
+│   ├── server.ts                 # MCP server entry point
+│   ├── tools/
+│   │   ├── analyze-project.ts    # Project analysis logic
+│   │   ├── recommend-strategy.ts # Decision tree application
+│   │   ├── generate-impl.ts      # Code generation
+│   │   └── estimate-costs.ts     # Cost calculation
+│   ├── data/
+│   │   ├── decision-trees.json   # Eval framework decision logic
+│   │   ├── quality-dimensions.json
+│   │   ├── cost-models.json      # Pricing data and formulas
+│   │   └── templates/            # Implementation templates
+│   │       ├── python/
+│   │       ├── typescript/
+│   │       └── monitoring/
+│   ├── prompts/
+│   │   ├── analysis-prompts.ts   # Prompt templates for built-in tools
+│   │   ├── research-prompts.ts
+│   │   └── implementation-prompts.ts
+│   └── utils/
+│       ├── pattern-detector.ts   # AI pattern recognition
+│       ├── metric-selector.ts    # Decision tree logic
+│       └── cost-calculator.ts    # Budget optimization
+├── README.md
+└── examples/
+    ├── basic-usage.md
+    └── integration-examples/
+```
+
+## **Usage Flow**
+
+### **1. Initial Analysis**
+
+```bash
+# User: "Analyze this project and recommend evaluation approach"
+
+# AI Assistant workflow:
+1. Uses built-in tools to scan codebase (read_many_files, glob)
+2. Calls analyze_ai_project with file summary
+3. Calls recommend_evaluation_strategy with analysis + user constraints
+4. Uses web_search to validate recommendations against latest trends
+5. Presents comprehensive evaluation strategy
+```
+
+### **2. Implementation Generation**
+
+```bash
+# User: "Generate the evaluation code for this strategy"
+
+# AI Assistant workflow:
+1. Calls generate_evaluation_implementation with strategy
+2. Uses returned templates and guidance to create complete implementation
+3. Generates setup documentation and deployment configs
+4. Provides step-by-step implementation guide
+```
+
+### **3. Cost Analysis**
+
+```bash
+# User: "What will this evaluation approach cost with 10k requests/month?"
+
+# AI Assistant workflow:
+1. Calls estimate_evaluation_costs with scale parameters
+2. Uses web_search to validate current tool pricing
+3. Presents detailed cost breakdown and ROI analysis
+4. Suggests optimizations for budget constraints
+```
+
+## **Key Benefits of This Architecture**
+
+### **1. Focused Responsibility**
+
+- MCP server handles evaluation-specific logic only
+- Leverages existing tools for general operations
+- Clean separation of concerns
+
+### **2. Flexibility**
+
+- Works with any MCP-compatible client
+- Can adapt to different built-in tool capabilities
+- Easy to extend with new evaluation methodologies
+
+### **3. Maintainability**
+
+- Evaluation logic isolated and versioned separately
+- No dependency on specific CLI implementations
+- Community can contribute evaluation knowledge
+
+### **4. Real-time Adaptation**
+
+- Web search integration keeps recommendations current
+- Can adapt to new tools and pricing
+- Validates against latest research
+
+This architecture provides the best of both worlds: specialized evaluation expertise through the MCP server, combined with the full power of general-purpose AI tools for research, file operations, and implementation assistance.