| title | Test Suite |
|---|---|
| version | 1.0.0-draft |
| last_updated | 2026-01-11 |
| status | Draft |
| category | Testing & Validation |
Version: 1.0.0-draft
This document defines the automated test suite for validating LEAF spec implementations. All implementations must pass these tests to be considered compliant with the specification.
Challenge: Unlike traditional REST APIs, AI applications are non-deterministic. This test suite balances strict functional testing with flexible AI quality assessment.
- Functional Tests - API behavior, data integrity, security
- AI Quality Tests - Response relevance, citation accuracy, hallucination detection
- Performance Tests - Response times, throughput
- Integration Tests - End-to-end workflows
POST /api/auth/register
Body: {
"email": "test@example.com",
"password": "SecurePass123!",
"displayName": "Test User" // optional
}
Expected:
- Status: 201
- Response includes: user.id, user.email, user.displayName, token
- Token is valid JWT or equivalent
- Password NOT in responsePOST /api/auth/register (with same email)
Expected:
- Status: 400 or 409
- Error code: CONFLICT or VALIDATION_ERROR
- Message indicates duplicate emailPOST /api/auth/login
Body: {
"email": "test@example.com",
"password": "SecurePass123!"
}
Expected:
- Status: 200
- Response includes: user, token
- Token works for authenticated endpointsPOST /api/auth/login
Body: {
"email": "test@example.com",
"password": "WrongPassword"
}
Expected:
- Status: 401
- Error code: UNAUTHORIZEDGET /api/user
Headers: Authorization: Bearer <token>
Expected:
- Status: 200
- Response includes: user.id, user.email, user.preferences, user.usage
- Password NOT in responseGET /api/user
Headers: Authorization: Bearer invalid_token
Expected:
- Status: 401
- Error code: UNAUTHORIZEDPOST /api/documents
Content-Type: multipart/form-data
Body: file=test.pdf, title="Test Document", tags=["test"]
Expected:
- Status: 201
- Response includes: document.id, status="processing", title, contentType, size
- Document ID is uniquePOST /api/documents
Body: {
"title": "Text Note",
"content": "This is test content about neural networks.",
"contentType": "text/markdown",
"tags": ["test", "notes"]
}
Expected:
- Status: 201
- Response includes: document.id, status="processing"// Wait for processing (poll or webhook)
GET /api/documents/:id
Expected (within 30s):
- status transitions to "ready"
- chunkCount > 0
- processedAt is setGET /api/documents?limit=10&offset=0
Expected:
- Status: 200
- Response includes: documents[], pagination.total, pagination.hasMore
- Documents sorted by createdAt descGET /api/documents?tag=test
Expected:
- Status: 200
- All returned documents have "test" tagGET /api/documents/:id
Expected:
- Status: 200
- Response includes: all document fields, content (if text-based)PUT /api/documents/:id
Body: {
"title": "Updated Title"
}
Expected:
- Status: 200
- Document title updated
- updatedAt changedDELETE /api/documents/:id
Expected:
- Status: 204
- Document no longer retrievable
- Embeddings removed from vector DB// User A creates document
// User B tries to access it
GET /api/documents/:userA_document_id
Headers: Authorization: Bearer <userB_token>
Expected:
- Status: 404 or 403
- Cannot access other user's dataPOST /api/conversations
Body: {
"title": "Test Conversation",
"documentIds": ["doc_123"]
}
Expected:
- Status: 201
- Response includes: conversation.id, title, documentIds, messageCount=0GET /api/conversations
Expected:
- Status: 200
- Response includes: conversations[], paginationGET /api/conversations/:id
Expected:
- Status: 200
- Response includes: conversation, messages[]PUT /api/conversations/:id
Body: {
"title": "New Title"
}
Expected:
- Status: 200
- Title updatedDELETE /api/conversations/:id
Expected:
- Status: 204
- Conversation and all messages deletedPOST /api/conversations/:id/messages
Body: {
"content": "What is a neural network?",
"stream": false
}
Expected:
- Status: 201
- Response includes: userMessage, assistantMessage
- assistantMessage has: content, citations[], tokenUsage
- Citations reference valid documentsPOST /api/conversations/:id/messages
Body: {
"content": "What is a neural network?",
"stream": true
}
Expected:
- Status: 200
- Content-Type: text/event-stream
- Events received in order:
1. message_start
2. Multiple content_delta
3. citations (optional)
4. message_end
5. done
- Stream closes properly// Client disconnects mid-stream
Expected:
- Server detects disconnect
- Cleanup resources
- No memory leaksPOST /api/search
Body: {
"query": "neural network architecture",
"limit": 5
}
Expected:
- Status: 200
- Response includes: results[] with documentId, content, relevanceScore
- Results ranked by relevance (descending)
- relevanceScore between 0.0-1.0POST /api/search
Body: {
"query": "machine learning",
"documentIds": ["doc_123"],
"limit": 10
}
Expected:
- Status: 200
- All results from doc_123 onlyPOST /api/search
Body: {
"query": "neural networks",
"minRelevance": 0.8
}
Expected:
- Status: 200
- All results have relevanceScore >= 0.8POST /api/documents/:id/summarize
Body: {
"length": "medium",
"focus": "key_points"
}
Expected:
- Status: 200
- Response includes: summary.content, tokenUsage
- Content length appropriate for "medium"POST /api/summaries
Body: {
"documentIds": ["doc_123", "doc_456"],
"query": "What do these documents say about AI?",
"length": "long"
}
Expected:
- Status: 200
- Response includes: summary.content, citations[]
- Citations reference both documentsPOST /api/documents
Body: { invalid: "data" }
Expected:
- Status: 400 or 422
- Error includes: code, message, detailsPOST /api/auth/register
Body: { "email": "test@example.com" } // missing password
Expected:
- Status: 400 or 422
- Error indicates missing fieldPOST /api/documents
Body: file=image.exe (invalid type)
Expected:
- Status: 400 or 415
- Error indicates unsupported file typePOST /api/documents
Body: file=huge_file.pdf (>50MB)
Expected:
- Status: 413
- Error code: PAYLOAD_TOO_LARGE// Send 1001 requests rapidly
Expected:
- First 1000 succeed
- 1001st returns 429
- Response includes: X-RateLimit-* headers
- Error includes retry-afterThese tests use standardized test documents and evaluate AI response quality.
Document A: "Neural Networks Basics"
- Content includes: definitions, architecture, training
- ~2000 words, 5 pages
Document B: "Machine Learning Overview"
- Content includes: supervised learning, unsupervised learning, ML types
- ~1500 words, 4 pages
Document C: "AI Ethics"
- Content includes: bias, fairness, accountability
- ~1800 words, 4 pages
POST /api/conversations/:id/messages
Body: {
"content": "What is a neural network?"
}
Evaluation:
✅ PASS: Response includes at least one citation
✅ PASS: Citation references Document A
✅ PASS: Citation excerpt found in Document A
❌ FAIL: No citations or incorrect documentPOST /api/conversations/:id/messages
Body: {
"content": "How do neural networks and machine learning relate?"
}
Evaluation:
✅ PASS: Citations include both Document A and B
✅ PASS: Citations relevant to question
❌ FAIL: Missing one document or irrelevant citationsPOST /api/conversations/:id/messages
Body: {
"content": "What does the document say about quantum computing?"
}
Evaluation:
✅ PASS: Response says "no information found" or similar
✅ PASS: No citations, or citations explicitly state uncertainty
❌ FAIL: Hallucinates quantum computing content
❌ FAIL: Citations don't match claimed sourcePOST /api/conversations/:id/messages
Body: {
"content": "Explain the backpropagation algorithm"
}
Evaluation (manual or LLM-as-judge):
✅ PASS: Response discusses backpropagation
✅ PASS: Information from documents
❌ FAIL: Off-topic response// Message 1
POST /api/conversations/:id/messages
Body: { "content": "What is a neural network?" }
// Message 2 (follow-up)
POST /api/conversations/:id/messages
Body: { "content": "How is it trained?" }
Evaluation:
✅ PASS: Response understands "it" refers to neural network
✅ PASS: Discusses training methods
❌ FAIL: Loses context, asks "what is 'it'?"POST /api/conversations/:id/messages
Body: {
"content": "What datasets are used in the documents?"
}
Evaluation:
✅ PASS: Only mentions datasets actually in documents
✅ PASS: Says "not mentioned" if not in docs
❌ FAIL: Invents datasets not in sourcePOST /api/documents/:id/summarize
Body: { "length": "short", "focus": "key_points" }
Evaluation:
✅ PASS: Summary captures main themes
✅ PASS: Length ~100-200 words
✅ PASS: No fabricated information
❌ FAIL: Missing key points or hallucinated contentPOST /api/search
Body: {
"query": "training neural networks",
"limit": 5
}
Evaluation:
✅ PASS: Top results discuss neural network training
✅ PASS: Relevance scores > 0.6 for top 3
❌ FAIL: Irrelevant results ranked highlyMeasure time to receive 201 response
Target: < 1 second
Acceptable: < 3 secondsMeasure time to first content_delta event
Target: < 500ms
Acceptable: < 2 secondsMeasure time to receive search results
Target: < 1 second
Acceptable: < 3 secondsSend 10 concurrent message requests
Evaluation:
✅ PASS: All complete successfully
✅ PASS: Average response time < 5s
❌ FAIL: Timeouts or errors1. Register new user
2. Upload 3 documents (PDF, TXT, MD)
3. Wait for processing
4. Create conversation
5. Send question
6. Verify response with citations
7. Send follow-up question
8. Verify context maintained
9. Search across documents
10. Generate summary
Expected:
✅ All steps succeed
✅ Data consistent across operations1. Upload document
2. Wait for processing
3. Query document content
4. Update document content
5. Wait for reprocessing
6. Query again (should reflect update)
7. Delete document
8. Verify removed from search
Expected:
✅ All steps succeed
✅ Updates reflected correctly
✅ Deletion complete// User A uploads document
// User B attempts access
Expected:
❌ FAIL if User B can access User A's document
✅ PASS if properly isolatedPOST /api/conversations/:id/messages
Body: {
"content": "'; DROP TABLE users; --"
}
Expected:
✅ PASS: Treated as regular query
✅ PASS: No database corruptionPOST /api/documents
Body: {
"title": "<script>alert('xss')</script>",
"content": "Test"
}
Expected:
✅ PASS: Script tags escaped in responses
✅ PASS: Not executed in frontendThese three scenarios test advanced architectural capabilities that reveal framework maturity. All implementations MUST handle these correctly to be considered production-ready.
Why "Turing Tests"? Just as the Turing Test evaluates AI intelligence, these tests evaluate an implementation's ability to handle AI edge cases intelligently - not just the happy path.
Scenario: User asks a question with no relevant documents in the knowledge base.
Purpose: Tests business logic intervention in the AI pipeline. Implementations must prevent the LLM from fabricating answers when there's no supporting context.
// 1. Upload test document
POST /api/documents
Body: file=lotr.pdf // The Lord of the Rings (full text)
// 2. Wait for processing
GET /api/documents/:id
Expected: status="ready"
// 3. Ask unrelated question
POST /api/conversations/:id/messages
Body: {
"content": "What is the capital of Mars?"
}Retrieval Phase:
// Vector search runs
// Returns chunks about Middle-earth geography
// Highest similarity score: 0.35 (far below threshold)Logic Check:
if (highestSimilarityScore < 0.75) {
// MUST NOT call LLM generation
return {
"content": "I cannot find this information in your knowledge base.",
"citations": [],
"confidence": "none"
}
}❌ FAIL - Hallucinates an answer
{
"content": "The capital of Mars is Olympus City...",
"citations": [{ "documentId": "lotr.pdf" }] // Invalid citation
}{
"content": "Based on the documents, I don't have information about Mars, but...",
"citations": [] // Better, but still unnecessary LLM call
}✅ PASS - Returns canned "not found" response without LLM call
{
"content": "I cannot find this information in your knowledge base.",
"citations": [],
"confidence": "none",
"retrievedChunks": 5,
"maxSimilarity": 0.35 // Shows why it failed
}- Cost: Prevents expensive LLM calls for irrelevant queries
- Trust: Prevents hallucinated citations that break user confidence
- Architecture: Tests business logic layer between retrieval and generation
Scenario: User asks a broad question that retrieves more context than fits in the model's context window.
Purpose: Tests complex orchestration and context management strategies. Implementations must handle token limits gracefully without silently truncating.
// 1. Upload 20 PDF documents (total ~500 pages)
POST /api/documents (20 times with different files)
// Files: financial_reports_2020-2024.pdf (5 years × 4 quarters)
// 2. Wait for all processing
GET /api/documents
Expected: All 20 documents with status="ready"
// 3. Ask broad question
POST /api/conversations/:id/messages
Body: {
"content": "Summarize all financial projections across all quarterly reports."
}Retrieval Phase:
// Vector search returns 50+ relevant chunks
// Total tokens: ~20,000 (retrieved chunks)
// Model limit: 8,000 tokens (context window)
// Problem: Cannot fit all contextContext Management Strategy (Implementation Must Choose One):
Option A: Map-Reduce
1. Summarize each document individually (5 passes)
2. Combine summaries into final answer (1 pass)
3. Total: 6 LLM calls, fits within limits
Response includes:
{
"content": "Across all quarterly reports...",
"citations": [...], // From all 20 documents
"strategy": "map-reduce",
"passes": 6
}Option B: Hierarchical Ranking
1. Rank chunks by similarity
2. Select top N chunks that fit in context
3. Generate answer from selected chunks
4. Document which sources were prioritized
Response includes:
{
"content": "Based on the most relevant sections...",
"citations": [...], // From top 8 documents only
"strategy": "ranked-selection",
"selectedDocuments": 8,
"totalDocuments": 20
}Option C: Iterative Refinement
1. Generate initial summary from top 10 chunks
2. Review remaining chunks against initial summary
3. Refine answer with new information
4. Multiple passes until convergence
Response includes:
{
"content": "Financial projections show...",
"citations": [...],
"strategy": "iterative-refinement",
"iterations": 3
}❌ FAIL - Crashes or returns error
{
"error": "CONTEXT_LIMIT_EXCEEDED",
"message": "Too many results"
}{
"content": "Based on Q1 2024 report...", // Only used first chunk
"citations": [/* only 1 document */] // Lost 19 documents
}
// No indication that context was limited✅ PASS - Implements explicit context management
{
"content": "Comprehensive summary...",
"citations": [/* from multiple documents */],
"metadata": {
"strategy": "map-reduce", // or other strategy
"totalChunks": 50,
"processedChunks": 50,
"contextManagement": "applied"
}
}- Completeness: Users expect answers that consider all relevant documents
- Transparency: Shows how implementation handles real-world constraints
- Architecture: Reveals orchestration patterns for multi-step AI workflows
Scope: This test is for OPTIONAL EXTENSION implementations only.
- ✅ Core Certification: Does NOT require this test - text-only is fully compliant
- 🟢 Multimodal Certification: Requires passing this test for extended cert
Scenario: User uploads a non-text file (image, chart, diagram) that requires vision model processing.
Purpose: Tests chaining different model modalities and extracting knowledge from non-text formats.
// 1. Upload image file
POST /api/documents
Body: file=quarterly_revenue_chart.png
// PNG chart showing bar graph of Q1-Q4 revenue
Expected:
- Status: 202 Accepted
- Document created with mime_type="image/png"Processing Pipeline:
1. Detect MIME type: image/png
2. Route to vision model processor
3. Call vision model (e.g., GPT-4o, Claude 3.5 with vision)
Prompt: "Describe this image in detail for retrieval purposes.
Extract all text, data, and visual information."
4. Store description as searchable text
5. Generate embedding from description
6. Mark document as status="ready"Vision Model Output:
This is a bar chart showing quarterly revenue for 2024:
Q1: $2.3M (blue bar)
Q2: $2.8M (blue bar)
Q3: $3.1M (blue bar)
Q4: $3.5M (blue bar, projected)
The chart shows 35% year-over-year growth.
The title reads "2024 Revenue Projections".
Queryability Test:
// User should be able to query the image content
POST /api/conversations/:id/messages
Body: {
"content": "What was Q3 revenue in 2024?"
}
Expected:
{
"content": "According to the revenue chart, Q3 2024 revenue was $3.1M.",
"citations": [{
"documentId": "quarterly_revenue_chart.png",
"content": "Q3: $3.1M (blue bar)..."
}]
}❌ FAIL - Rejects image files
{
"error": "UNSUPPORTED_FILE_TYPE",
"message": "Only text and PDF files supported"
}// Document created but not processed
// User queries return no results
// No vision model integration✅ PASS - Extracts and embeds image description
// Document processed with vision model
// Description stored and embedded
// Queries return relevant information
// Citations link back to image fileTier 1 (Required for "Multimodal" certification):
- ✅ PNG, JPG image support
- ✅ Vision model integration
- ✅ Text extraction from images
- ✅ Searchable image descriptions
Tier 2 (Optional):
- 🔶 OCR for scanned documents
- 🔶 Chart/diagram understanding
- 🔶 Audio transcription
- 🔶 Video analysis
Note: Implementations CAN choose to be "text-only" and skip this test. Mark as:
- 🟡 "Text-Only Implementation" - Valid, but limited scope
- 🟢 "Multimodal Implementation" - Passes this test
- Real-World Usage: Users upload screenshots, charts, whiteboards
- Model Chaining: Tests ability to orchestrate multiple AI models
- Format Flexibility: Shows implementation can handle diverse inputs
| Test | Critical? | Purpose | Passing Score |
|---|---|---|---|
| Hallucination Guard | 🔴 Required | Prevent fabricated answers | Must pass |
| Context Squeeze | 🟡 Important | Handle token limits gracefully | 80% score |
| Multimodal Ingestion | 🟢 Optional | Chain vision models (if supported) | N/A for text-only |
Certification Requirements:
- ✅ Text-Only Certification: Pass Hallucination Guard + Context Squeeze
- ✅ Multimodal Certification: Pass all three tests
# Install test runner
npm install -g ai-realworld-tests
# Configure API URL
export API_URL=http://localhost:3000/api
# Run full test suite
ai-realworld-tests run --all
# Run specific category
ai-realworld-tests run --functional
ai-realworld-tests run --ai-quality
ai-realworld-tests run --performance
# Run with test documents
ai-realworld-tests run --documents=./test-docsLEAF Test Suite v1.0.0
======================
Functional Tests: 45/45 passed ✅
AI Quality Tests: 8/10 passed ⚠️
- Citation Exists: PASS
- Multiple Citations: PASS
- No Hallucination: FAIL (hallucinated quantum computing)
- Response Relevance: PASS
- Context Awareness: PASS
- Grounding: PASS
- Summary Accuracy: FAIL (too short)
- Search Relevance: PASS
Performance Tests: 5/6 passed ⚠️
- Upload Response: PASS (650ms)
- First Token: PASS (420ms)
- Search Response: FAIL (3.2s - exceeds 3s acceptable)
Integration Tests: 2/2 passed ✅
Security Tests: 5/5 passed ✅
Overall: 65/68 tests passed (95.6%)
Certification: COMPLIANT ✅
(Minimum 90% required, all critical tests passed)
To be certified compliant, implementations must:
- ✅ Pass 100% of functional tests (critical)
- ✅ Pass 80%+ of AI quality tests
- ✅ Pass 100% of security tests (critical)
- ✅ Pass 80%+ of integration tests
- 🔶 Pass 70%+ of performance tests (recommended)
Beyond pass/fail test results, reviewers and evaluators should assess implementations on these architectural dimensions. This rubric helps compare frameworks objectively and reveals hidden complexity.
Question: How well does the framework handle asynchronous AI operations?
Test Methodology:
- Upload 50MB PDF
- Measure API response time (should be < 200ms for 202 Accepted)
- Check if background job system is visible
- Monitor CPU/memory during processing
Grading Scale:
⭐⭐⭐⭐⭐ Excellent (10/10 points)
- Response < 200ms
- Native job queue support
- Observable job status
- No main thread blocking
- Example: FastAPI + Celery, Next.js + BullMQ
⭐⭐⭐⭐ Good (7/10 points)
- Response < 1s
- Job queue via library
- Some observability
- Example: Express + custom workers
⭐⭐ Poor (3/10 points)
- Response > 5s
- Synchronous processing
- Blocks main thread
- Example: Basic Flask without workers
Key Insight: Frameworks with native async primitives (Go, Rust, Node) vs. frameworks requiring external workers (Python WSGI)
Question: How verbose and complex is vector similarity code?
Test Methodology:
- Count lines of code for implementing similarity search
- Check if ORM has native vector support
- Evaluate if raw SQL is required
Grading Scale:
⭐⭐⭐⭐⭐ Excellent (10/10 points)
- Native ORM vector support (< 10 lines)
- Example: Supabase JS client with pgvector, Prisma with vector extension
const results = await db.chunks.findMany({
orderBy: vector.l2Distance(embedding, 'asc'),
take: 5
})⭐⭐⭐⭐ Good (7/10 points)
- Library abstraction (< 50 lines)
- Example: LangChain vector stores
vectorstore = Chroma(...)
results = vectorstore.similarity_search(query, k=5)⭐⭐ Poor (3/10 points)
- Raw SQL required (> 100 lines)
SELECT id, content,
1 - (embedding <=> '[0.1,0.2,...]'::vector) AS similarity
FROM chunks
ORDER BY embedding <=> '[0.1,0.2,...]'::vector
LIMIT 5;Key Insight: ORMs without vector support force SQL, revealing database abstraction limitations
Question: Does the framework fight against or embrace streaming?
Test Methodology:
- Implement SSE endpoint for streaming chat
- Check for buffering issues
- Test with different HTTP servers
- Monitor for manual flush calls
Grading Scale:
⭐⭐⭐⭐⭐ Excellent (10/10 points)
- Native streaming support
- No configuration required
- No buffering issues
- Example: FastAPI with StreamingResponse, Next.js Route Handlers
async def stream():
for chunk in llm.stream():
yield chunk
return StreamingResponse(stream())⭐⭐⭐⭐ Good (7/10 points)
- Requires middleware or configuration
- Works reliably after setup
- Example: Express with compression middleware disabled
⭐⭐ Poor (3/10 points)
- Buffering issues
- Requires manual flush calls
- Framework fights against streaming
- Example: Python WSGI servers (Gunicorn, uWSGI)
Key Insight: ASGI vs WSGI, frameworks designed for streaming vs retrofitted
Question: How easy is it to swap AI providers (OpenAI → Anthropic)?
Test Methodology:
- Change from OpenAI to Anthropic
- Count files that need modification
- Evaluate if interface/adapter pattern is natural
Grading Scale:
⭐⭐⭐⭐⭐ Excellent (10/10 points)
- Interface-based design
- Change 1 config value or environment variable
- Example: Implementation uses provider interfaces, swap via DI
// .env change: LLM_PROVIDER=anthropic
// No code changes needed⭐⭐⭐⭐ Good (7/10 points)
- Adapter pattern
- Change 1-2 classes
- Clear abstraction boundary
⭐⭐ Poor (3/10 points)
- Find/replace throughout codebase
- Provider-specific code scattered
- No abstraction layer
Key Insight: Languages with interfaces (TypeScript, Java, Go) vs dynamic languages (Python, Ruby)
Question: How are AI errors (rate limits, model failures) surfaced?
Test Methodology:
- Trigger rate limit error (too many requests)
- Trigger model error (invalid request)
- Check error messages returned to user
- Evaluate retry logic
Grading Scale:
⭐⭐⭐⭐⭐ Excellent (10/10 points)
- Typed error system (Result<T, E> or similar)
- Graceful degradation
- Retry logic with exponential backoff
- User-friendly error messages
return {
error: {
code: "RATE_LIMIT_EXCEEDED",
message: "Too many requests. Try again in 5 minutes.",
retryAfter: 300
}
}⭐⭐⭐⭐ Good (7/10 points)
- Try/catch with custom messages
- Some retry logic
- Errors logged
⭐⭐ Poor (3/10 points)
- Silent failures
- Raw stack traces to user
- No retry mechanism
Key Insight: Type systems (Rust, TypeScript) vs runtime errors (Python, JavaScript)
Question: Can users see token usage and costs?
Test Methodology:
- Check if token counts are returned in API
- Evaluate if costs are calculated
- Check for dashboard or tracking
Grading Scale:
⭐⭐⭐⭐⭐ Excellent (10/10 points)
- Real-time cost dashboard
- Token counts in every response
- Historical cost tracking
- Alerts for budget limits
⭐⭐⭐⭐ Good (7/10 points)
- Token counts in API response metadata
- Basic cost calculation
{
"content": "...",
"usage": {
"promptTokens": 1234,
"completionTokens": 567,
"totalCost": 0.0234
}
}⭐⭐ Poor (3/10 points)
- No visibility into usage
- No cost tracking
- Users blind to expenses
Key Insight: Production-ready vs proof-of-concept implementations
Question: How much boilerplate and complexity does the framework require?
Test Methodology:
- Count total lines of code
- Evaluate dependency count
- Measure cognitive complexity
- Check for framework-specific ceremony
Grading Scale:
⭐⭐⭐⭐⭐ Excellent (10/10 points)
- < 2000 LOC for complete implementation
- < 20 dependencies
- Clear separation of concerns
- Minimal boilerplate
⭐⭐⭐⭐ Good (7/10 points)
- 2000-5000 LOC
- 20-40 dependencies
- Some framework ceremony
⭐⭐ Poor (3/10 points)
-
5000 LOC
-
40 dependencies
- Heavy framework ceremony
- Excessive abstraction layers
Key Insight: Framework magic vs explicit code trade-offs
Total Points: Sum of all categories (Max 70 points)
Rating Tiers:
🥇 Outstanding (60-70 points)
- Production-ready reference implementation
- Framework strengths clearly demonstrated
- Few compromises or workarounds
🥈 Strong (45-59 points)
- Solid implementation with minor rough edges
- Framework mostly suitable for task
- Some workarounds needed
🥉 Acceptable (30-44 points)
- Functional but reveals framework limitations
- Significant compromises required
- Educational about trade-offs
- Framework poorly suited for AI workloads
- Major architectural issues
- Requires heavy customization
Next.js + Vercel AI SDK + Supabase:
- Async Architecture: 10/10 (native job support)
- Vector Integration: 10/10 (Supabase client)
- Streaming: 10/10 (Route Handlers)
- Provider Abstraction: 10/10 (Vercel AI SDK)
- Error Handling: 8/10 (good try/catch)
- Cost Tracking: 7/10 (manual implementation)
- Code Complexity: 9/10 (< 2000 LOC)
- Total: 64/70 (Outstanding)
Python FastAPI + Raw APIs + pgvector:
- Async Architecture: 10/10 (FastAPI + Celery)
- Vector Integration: 5/10 (raw SQL required)
- Streaming: 10/10 (StreamingResponse)
- Provider Abstraction: 7/10 (adapter pattern)
- Error Handling: 8/10 (custom exceptions)
- Cost Tracking: 9/10 (detailed tracking)
- Code Complexity: 6/10 (~4000 LOC)
- Total: 55/70 (Strong)
Flask + LangChain:
- Async Architecture: 3/10 (WSGI blocking)
- Vector Integration: 8/10 (LangChain abstraction)
- Streaming: 3/10 (buffering issues)
- Provider Abstraction: 9/10 (LangChain)
- Error Handling: 5/10 (basic try/catch)
- Cost Tracking: 4/10 (minimal)
- Code Complexity: 7/10 (~3000 LOC)
- Total: 39/70 (Acceptable)
Purpose of This Rubric:
This grading system helps:
- Learners understand which frameworks excel at AI workloads
- Framework creators identify areas for improvement
- Teams make informed technology decisions
- Community compare implementations objectively
The goal is not to crown a "winner" but to reveal architectural trade-offs - just like RealWorld revealed that Django has more boilerplate than Express, but better ORM than raw Node.
Standard test documents and fixtures available at:
https://github.com/ai-realworld/test-fixtures
Includes:
- Sample PDFs, text files, markdown docs
- Pre-computed embeddings (for validation)
- Expected responses (for quality checks)
- Test conversation transcripts
Next: See ui-requirements.md for frontend specifications.