A high-concurrency GUI for stress testing system performance and connectivity with local Ollama LLM/SLM models. Deploy multiple parallel agent workers, stream responses in real-time, inspect model specifications, and measure system throughput -- all from a premium, responsive web interface.
Built as a testing harness for Deep Researcher V2.
- Overview
- Features
- Architecture
- Prerequisites
- Installation
- Usage
- Configuration
- Project Structure
- Tech Stack
- License
Ollama exposes local LLMs through a REST API on localhost:11434. This application provides a graphical interface to push those models to their limits: spawn N concurrent chat sessions, stream tokens in parallel, compare response times, and inspect model metadata -- without writing a single line of code.
The GUI is intentionally built so that end users do not need to interact with the terminal or raw API calls to run performance tests.
- Auto-discovers all locally pulled Ollama models on startup.
- Displays parameter count, quantization level, format, family, and disk footprint for each model.
- Filter models by parameters (< 5B, 5B-14B, > 14B), capabilities (chat, embedding, vision, tools, etc.), and disk size (< 2 GB, 2-10 GB, > 10 GB).
- Expand any model to view full technical specifications, capabilities badges, and architecture details.
- View the raw Modelfile and license information in a dedicated modal.
- Launch 1 to 15 parallel agent workers, each in its own Kanban-style card.
- Every session runs in its own Web Worker, ensuring that token streaming for one session never blocks another.
- Three execution modes per session:
- Stream -- real-time token-by-token output via
/api/chatwithstream: true. - Generate -- single-shot response via
/api/generate. - Structured -- forces JSON schema output for typed responses.
- Stream -- real-time token-by-token output via
- Per-session controls: model selector, system prompt, personality, prompt input, image attachments (for vision-capable models), and abort.
- Global controls: Bulk Send (fire all sessions at once), Fill Empty (populate all prompts with randomized stress-test prompts).
- View all models currently loaded in RAM/VRAM.
- Displays memory consumption (RAM and VRAM), parameter size, quantization, and expiry timer.
- Unload individual models from memory directly from the UI.
- Auto-refreshes on open to reflect the current server state.
- 10 system prompts included out of the box: General Assistant, Code Assistant, Creative Writer, Research Analyst, Tutor, Summarizer, Translator, DevOps Engineer, Benchmark Assistant, and a raw "No System Prompt" mode.
- 11 personality modifiers: Neutral, Friendly, Concise, Detailed, Socratic, Enthusiastic, Formal, Witty, ELI5, Pirate, and None.
- Default configuration modal lets you set the model, system prompt, personality, mode, and initial prompt for all new sessions at once. Persisted to
localStorage.
- Global metrics bar in the Kanban header: total tokens generated, number of actively generating sessions (with live pulse indicator).
- Per-session stats rendered after each completion: total duration, load duration, prompt eval count, eval count, and tokens-per-second.
- Dark mode and light mode toggle.
- Glassmorphism, gradient backgrounds, and Framer Motion animations throughout.
- Responsive layout from mobile to ultrawide.
- Markdown rendering via Streamdown with syntax-highlighted code blocks (Shiki), math, and Mermaid diagram support.
- Vision model support: attach images to prompts when a vision-capable model is selected.
Browser (React SPA)
|
|-- Home Page
| |-- Model Library (listModels + showModel)
| |-- Quick Launch -> navigates to /chat?workers=N&model=X
|
|-- Chat Page (Concurrency Kanban)
| |-- SessionCard (x N)
| | |-- Web Worker (agentWorker.ts)
| | | |-- streamChat() / generate() / generateStructured()
| | | |-- Posts chunks back to main thread via postMessage
| | |-- rAF-batched state updates (zero jank)
| |
| |-- Active Models Modal (listRunningModels, unloadModel)
| |-- Default Config Modal (localStorage)
|
|-- Ollama Client (src/lib/ollamaClient.ts)
|-- Official Ollama JS SDK
|-- Connects to http://localhost:11434
|-- Concurrent-safe, streaming-first design
Each session card spawns a dedicated Web Worker (src/workers/agentWorker.ts). The worker performs the actual fetch and streaming against the Ollama API, then posts token chunks back to the main thread. The main thread batches UI updates via requestAnimationFrame to maintain 60fps even under heavy load (20+ simultaneous streams).
- Node.js >= 18
- Ollama installed and running locally.
- Install: https://ollama.com
- Pull at least one model:
ollama pull gemma3
- Verify the server is running:
ollama list
The application connects to http://localhost:11434 by default.
By default, Ollama processes requests sequentially -- one prompt at a time. To take full advantage of parallel agent workers, you must configure two environment variables before starting the Ollama server:
| Variable | Purpose | Recommended Value |
|---|---|---|
OLLAMA_NUM_PARALLEL |
Number of requests a single loaded model can handle concurrently. | Match your worker count (e.g. 8) |
OLLAMA_MAX_LOADED_MODELS |
Maximum number of distinct models kept in memory simultaneously. | 2 or more if testing multiple models |
Setting the variables:
Linux / macOS:
export OLLAMA_NUM_PARALLEL=8
export OLLAMA_MAX_LOADED_MODELS=2
ollama serveWindows (PowerShell):
$env:OLLAMA_NUM_PARALLEL = "8"
$env:OLLAMA_MAX_LOADED_MODELS = "2"
ollama serveTo persist these across reboots, add them to your shell profile (.bashrc, .zshrc) or set them as system environment variables on Windows.
What changes when these variables are set:
- Without them: Ollama queues incoming requests internally. If you launch 10 workers, only one actually generates at a time; the remaining 9 wait in a FIFO queue. Streaming appears sequential, and stress test results will not reflect true parallel throughput.
- With
OLLAMA_NUM_PARALLEL=N: The model processes up to N requests simultaneously. All N workers stream tokens at the same time, giving you an accurate picture of how the model (and your hardware) handle concurrent load. Note that higher parallelism increases VRAM usage proportionally -- each parallel slot requires its own KV-cache allocation. - With
OLLAMA_MAX_LOADED_MODELS=M: Ollama keeps up to M models loaded in memory at once. Without this, switching between models in different sessions triggers repeated load/unload cycles, adding significant latency. Setting this higher lets you run heterogeneous tests (different models across sessions) without cold-start penalties.
Important: Setting OLLAMA_NUM_PARALLEL too high relative to your available VRAM will cause out-of-memory errors or force the model to fall back to CPU inference, which drastically reduces throughput. Start with a conservative value (4-6) and increase incrementally while monitoring GPU memory usage.
# Clone the repository
git clone https://github.com/pixelThreader/Ollama-Model-GUI-Tests.git
cd Ollama-Model-GUI-Tests
# Install dependencies
npm install
# Start the development server
npm run devThe application will be available at http://localhost:5173.
The landing page presents two main sections:
Quick Benchmark (left panel)
- Set the number of workers (1-15).
- Select a target model from the dropdown. Each option shows the model name, parameter count, and disk size.
- Click "Launch N Workers" to navigate to the Chat page with pre-configured sessions.
Alternatively, use the preset cards:
- Light Load -- 3 parallel agents.
- Stress Test -- 15 parallel agents.
Model Library (right panel)
- Browse all locally available models.
- Use the three filter dropdowns (Parameters, Capabilities, Disk Footprint) to narrow down the list.
- Expand any model to view detailed specifications.
- Click "View Modelfile and License" to inspect the raw configuration.
After launching workers (or navigating to /chat directly):
- Each session appears as an independent card.
- Type a prompt or click "Fill Empty" to populate all cards with randomized prompts (128 million permutations).
- Click the send button on individual cards, or use "Bulk Send" to fire all sessions simultaneously.
- Watch real-time token streaming across all sessions in parallel.
- After generation completes, the raw streamed text is formatted through Streamdown for full markdown rendering.
Session Controls:
- Change the model, system prompt, personality, or mode per session using the dropdown at the top of each card.
- Attach images for vision-capable models using the image icon in the composer.
- Abort a running generation with the stop button.
- Remove a session with the close button.
Global Controls:
- Bulk Send -- appears when multiple sessions have prompts ready; sends all at once.
- Fill Empty -- populates empty prompt fields with diverse, randomized stress-test prompts.
- Active Models -- opens a modal showing all models currently loaded in memory with options to unload them.
- Default Config -- opens a modal to set default model, system prompt, personality, mode, and prompt for new sessions.
- Add Session -- adds a new card to the Kanban board.
The following system prompts are available out of the box (defined in src/lib/prompts.ts):
| ID | Label | Description |
|---|---|---|
general |
General Assistant | Helpful, accurate, well-rounded assistant |
coder |
Code Assistant | Writing, reviewing, debugging, and explaining code |
writer |
Creative Writer | Storytelling, copywriting, and creative content |
analyst |
Research Analyst | Deep analysis with multiple perspectives |
tutor |
Tutor | Patient step-by-step teaching |
summarizer |
Summarizer | Distills content into concise summaries |
translator |
Translator | Natural translation between languages |
devops |
DevOps Engineer | Infrastructure, CI/CD, containers, sysadmin |
benchmark |
Benchmark Assistant | Designing and interpreting LLM performance tests |
none |
No System Prompt | Raw model behavior with no instructions |
| ID | Label | Description |
|---|---|---|
neutral |
Neutral | Professional, balanced, straightforward |
friendly |
Friendly | Warm, approachable, conversational |
concise |
Concise | Minimal words, maximum clarity |
detailed |
Detailed | In-depth explanations with examples |
socratic |
Socratic | Guides through questions |
enthusiastic |
Enthusiastic | Excited, passionate, motivating |
formal |
Formal | Academic, precise, structured |
witty |
Witty | Clever, sarcastic, but helpful |
eli5 |
ELI5 | Simplest possible terms |
pirate |
Pirate | Swashbuckling pirate dialect |
none |
None | No personality modifier |
The Default Config modal saves the following to localStorage:
- Model -- which model new sessions default to.
- System Prompt -- which system prompt to apply.
- Personality -- which personality modifier to apply.
- Mode --
stream,generate, orstructured. - Prompt -- a default prompt to pre-fill in new sessions.
.
├── public/
│ ├── logo.jpg # Application logo
│ └── favicon.ico # Browser favicon
├── src/
│ ├── Layout.tsx # Root layout with header and navigation
│ ├── main.tsx # Application entry point
│ ├── routes.ts # Route definitions (Home, Chat)
│ ├── index.css # Global styles and Tailwind config
│ ├── pages/
│ │ ├── Home.tsx # Landing page with model library and quick launch
│ │ └── Chat.tsx # Concurrency Kanban with parallel sessions
│ ├── components/
│ │ ├── SessionCard.tsx # Individual chat session (Web Worker integration)
│ │ ├── ActiveModelsModal.tsx # Modal for managing loaded models
│ │ ├── DefaultConfigModal.tsx # Modal for default session configuration
│ │ ├── ThemeToggle.tsx # Dark/light mode toggle
│ │ ├── ai-elements/ # Chat UI primitives (message, code-block, reasoning)
│ │ └── ui/ # Shadcn/ui components (button, card, select, etc.)
│ ├── lib/
│ │ ├── ollamaClient.ts # Ollama API client (streaming, generate, structured)
│ │ ├── prompts.ts # System prompts and personality definitions
│ │ ├── dummy.ts # Random prompt generator (128M permutations)
│ │ └── utils.ts # Utility functions
│ └── workers/
│ └── agentWorker.ts # Web Worker for concurrent streaming
├── index.html # HTML entry point
├── package.json # Dependencies and scripts
├── vite.config.ts # Vite configuration
├── tsconfig.json # TypeScript configuration
└── LICENSE # MIT License
| Layer | Technology |
|---|---|
| Framework | React 19, TypeScript 5.9 |
| Bundler | Vite 7 |
| Routing | TanStack Router |
| Styling | Tailwind CSS 4, Shadcn/ui |
| Animations | Framer Motion |
| Markdown | Streamdown (streaming markdown parser) |
| Syntax Highlighting | Shiki |
| LLM Client | Official Ollama JS SDK |
| Concurrency | Web Workers, requestAnimationFrame batching |
| State | React hooks, localStorage for persistence |
| Notifications | Sonner |
The following screenshots and recordings should be placed in the public/screenshots/ and public/recordings/ directories:
| File | Description |
|---|---|
screenshots/home-page.png |
Full Home page with hero, Quick Benchmark, and Model Library visible |
screenshots/model-library.png |
Model Library section with filters active and a model expanded |
screenshots/concurrency-kanban.png |
Chat page with 5+ sessions streaming responses simultaneously |
screenshots/active-models-modal.png |
Active Models modal showing loaded models with memory statistics |
screenshots/quick-benchmark.png |
Quick Benchmark panel with model selected and workers configured |
screenshots/session-card-detail.png |
Single session card showing a completed response with performance stats |
This project is licensed under the MIT License. See the LICENSE file for details.
Copyright (c) 2026 pixelThreader