Test and benchmark prompts accross LLM providers and models
This tool is aimed at agentic use-cases for large production applications that require fast and reliable llm calls. For example, extracting sentiment from social media posts, converting a sentence into structured JSON, etc.
- Multi-Provider Testing – OpenAI, Bedrock, DeepSeek, Gemini, Groq, OpenRouter
- Parallel Execution – Run tests concurrently across all configured LLMs
- Repeatability – Each test runs N times per model to measure consistency
- Version Control – Full prompt history with easy rollback
# Install dependencies
bun install
# Start development server
bun dev
# Open http://localhost:3000Configure API keys in the app's Configuration page. At least one provider is required.
- Prompts – Create and version your system prompts
- Test Cases – Add input/expected output pairs (JSON) for each prompt
- Test Runs – Execute tests and view per-model scores
bun dev # Backend with hot reload
bun dev:frontend # Frontend dev server
bun run build # Build frontend + backend
bun run lint # Lint backend
bun run test # Unit tests
bun run test:e2e # E2E tests (Playwright)
bun run format # Format code
bun run db:studio # Drizzle Studio├── src/ # Backend (Express + Bun)
│ ├── server.ts # API routes
│ ├── db/ # Drizzle schema & init
│ ├── llm-clients/ # Provider clients
│ └── services/ # Test runner
├── frontend/ # SvelteKit app
│ └── src/
│ ├── lib/ # Components & stores
│ └── routes/ # Pages
├── drizzle/ # Database migrations
└── data/ # SQLite database
MIT