Scale coding agents from 1 user to 100+ users on a single server
Eliminate the subprocess bottleneck. Co-locate codebases with inference. Memory-map everything.
Traditional coding agent architectures suffer from fundamental performance bottlenecks:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRADITIONAL ARCHITECTURE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββ ββββββββββββββββββββ
β Client β βββββββββββββββββββββββΊ β LLM Server β
β (Laptop) β Network Latency β (vLLM/GPU) β
βββββββββββββββ ββββββββββββββββββββ
β
β Local file system
β
βΌ
βββββββββββββββ
β Codebase β
β (Disk) β
βββββββββββββββ
Problems:
- Network Overhead: Every tool call (grep, ls, read) requires a round-trip to the client
- Process Spawn Overhead: Each grep spawns a subprocess (~10-15ms overhead)
- No Multi-Tenancy: Can't efficiently serve 100+ users on one GPU server
- Poor Locality: Code is on client, inference is on server
Example: A single agent turn with 10 tool calls
LLM β grep β 15ms process spawn + network RTT
LLM β grep β 15ms process spawn + network RTT
LLM β read β network RTT
LLM β grep β 15ms process spawn + network RTT
...
Total: 150-300ms overhead per turn (just for tooling!)
Curserve inverts the traditional architecture by co-locating codebases with inference and eliminating subprocess overhead through memory-mapped in-process operations.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CURSERVE ARCHITECTURE β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β vLLM Process (GPU) β β
β β ββββββββββββββ ββββββββββββββ ββββββββββββββ β β
β β β Request 1 β β Request 2 β β Request N β β β
β β β (User A) β β (User B) β β (User C) β β β
β β βββββββ¬βββββββ βββββββ¬βββββββ βββββββ¬βββββββ β β
β ββββββββββΌββββββββββββββββΌββββββββββββββββΌβββββββββββββββ β
β β β β β
β β Unix Domain β Unix Domain β β
β β Socket IPC β Socket IPC β β
β βΌ βΌ βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Memory Search Service (Rust Daemon) β β
β β β β
β β ββββββββββββββββ ββββββββββββββββ βββββββββββββββ β β
β β β Codebase A β β Codebase B β β Codebase C β β β
β β β (mmap'd RAM) β β (mmap'd RAM) β β (mmap'd RAM)β β β
β β β β β β β β β β
β β β β’ In-memory β β β’ In-memory β β β’ In-memory β β β
β β β grep β β grep β β grep β β β
β β β β’ 0.5-3ms β β β’ 0.5-3ms β β β’ 0.5-3ms β β β
β β ββββββββββββββββ ββββββββββββββββ βββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β²
β SSH + Binary Invocation
β
ββββββββββ΄βββββββββ
β curserve β
β [workspace] β
β [prompt] β
βββββββββββββββββββ
Clients
Benefits:
- β No process spawn overhead: grep/ls/read execute in-process
- β Memory-mapped I/O: 10-50x faster than subprocess ripgrep
- β Multi-tenant ready: One daemon serves 100+ concurrent agent sessions
- β Zero network latency: Tools run in the same datacenter as LLM
- β Scales to hundreds of users: Share GPU + storage infrastructure
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. Client SSH Connection β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ssh user@curserve-server
$ curserve ~/my-codebase "fix the authentication bug"
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. qwen-code-ipc Initialization β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
qwen-code-ipc starts
ββ Connects to /tmp/mem_search_service_requests.sock
ββ Sends: {"type": "alloc_pid", "pid": 12345, "repo_dir_path": "..."}
ββ Memory Search Service memory-maps entire codebase into RAM
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. Agent Execution Loop β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββ
β vLLM (Qwen3) β
ββββββββββ¬ββββββββββ
β
ββββββββββΌβββββββββββ
β "grep for β
β authentication β
β functions" β
ββββββββββ¬βββββββββββ
β
ββββββββββββββββΌβββββββββββββββ
β qwen-code-ipc Tool Layer β
β β
β grep() intercepts process β
β spawn, calls IPC instead β
ββββββββββββββββ¬βββββββββββββββ
β IPC Request
ββββββββββββββββΌβββββββββββββββ
β Memory Search Service β
β β
β Searches in-memory files β
β Returns: "auth.py:42:..." β
ββββββββββββββββ¬βββββββββββββββ
β 0.5-3ms
ββββββββββββββββΌβββββββββββββββ
β qwen-code-ipc β
β β
β Formats response for LLM β
ββββββββββββββββ¬βββββββββββββββ
β
ββββββββββΌβββββββββββ
β vLLM β
β "Found auth bug β
β on line 42..." β
βββββββββββββββββββββ
Repeat for 10-20 tool calls per agent turn
Total overhead: ~10-30ms (vs 150-300ms traditional)
curserve/
β
βββ mem-search-service/ # Rust daemon for in-memory operations
β βββ src/
β β βββ lib.rs # MmapCache: memory-mapped file management
β β βββ service.rs # Unix socket IPC server
β β βββ benchmark.rs # Performance comparison tools
β βββ Cargo.toml # Dependencies: ripgrep, memmap2, etc.
β βββ target/release/
β βββ mem-search-service # Compiled daemon binary
β
βββ qwen-code-ipc/ # Modified qwen-code framework
βββ packages/
β βββ core/
β β βββ src/
β β βββ tools/
β β β βββ ripGrep.ts # Intercepted grep tool
β β βββ utils/
β β βββ ipcClient.ts # Unix socket IPC client
β βββ cli/
β βββ src/
β βββ index.ts # Entry point
βββ dist/
βββ cli.js # Compiled qwen-code binary
Traditional coding agents spawn ripgrep subprocesses that read from disk on every search.
// mem-search-service/src/lib.rs
pub struct MmapCache {
pub files: Vec<MmappedFile>,
}
impl MmapCache {
pub fn new(root_path: &Path) -> Result<Self> {
// Walk directory tree (respecting .gitignore)
// Memory-map EVERY text file into RAM
// ~10-20MB for typical codebases
}
pub fn search(&self, pattern: &str) -> Vec<Match> {
// Search directly in RAM using ripgrep internals
// No subprocess spawn, no disk I/O
// 0.5-3ms vs 10-15ms subprocess
}
}Performance Impact:
| Codebase Size | Subprocess grep | In-Memory grep | Speedup |
|---|---|---|---|
| Small (100 files) | ~10ms | ~0.5ms | 20x |
| Medium (500 files) | ~15ms | ~1ms | 15x |
| Large (1000+ files) | ~20ms | ~3ms | 7x |
We forked qwen-code and modified the tool layer to use IPC instead of spawning processes.
Before (qwen-code):
// Spawns new process for every grep call
async function grep(pattern: string) {
const child = spawn('rg', [pattern, ...args]);
return await collectOutput(child); // ~10-15ms overhead
}After (qwen-code-ipc):
// packages/core/src/tools/ripGrep.ts
async function performRipgrepSearch(options) {
try {
// Try IPC first
const output = await requestGrepIPC(
workspacePath,
pattern,
[absolutePath],
ipcOptions
);
return parseRipgrepOutput(output); // ~0.5-3ms
} catch (error) {
// Graceful fallback to subprocess if IPC unavailable
return performRipgrepSearchDirect(options);
}
}IPC Protocol (Unix Domain Sockets):
Client Memory Search Service
β β
ββ Connect to /tmp/mem_search_... β
β β
ββ Send: {"type": "alloc_pid", ...} βββββΊ β
β ββ mmap codebase
β ββ create /tmp/qwen_code_response_12345.sock
β βββββ {"response_status": 1} ββββββββββββ€
β β
ββ Send: {"type": "request_ripgrep",...}ββΊβ
β ββ search in-memory files
β βββββ {"text": "file.py:42:..."} ββββββββ€
β β
One mem-search-service daemon handles requests from 100+ concurrent qwen-code-ipc instances.
// mem-search-service/src/service.rs
struct ServiceState {
codebases: HashMap<u32, MmapCache>, // PID β memory-mapped codebase
response_sockets: HashMap<u32, UnixStream>, // PID β response channel
}
// Three-threaded architecture:
// 1. Request listener: Accepts new connections
// 2. Connection acceptor: Manages per-client response sockets
// 3. Request worker: Executes searches in-memory
fn request_worker(rx: Receiver<Request>, state: Arc<Mutex<ServiceState>>) {
loop {
let (request, stream) = rx.recv().unwrap();
match request {
Request::AllocPid { pid, repo_dir_path } => {
// Memory-map entire codebase
let cache = MmapCache::new(&repo_dir_path)?;
state.codebases.insert(pid, cache);
}
Request::RequestRipgrep { pid, pattern, .. } => {
// Search in-memory, no I/O
let results = state.codebases[&pid].search(&pattern)?;
send_response(results);
}
}
}
}Resource Usage:
- Memory: ~15-30MB per codebase (text files only, binaries skipped)
- CPU: Shared across all users (ripgrep is already parallelized)
- Storage: All codebases on fast NVMe (or network-attached if needed)
$ ./target/release/benchmark ~/linux-kernel "static inline" 100Results:
================================================================================
Memory-Mapped Search (Curserve)
================================================================================
Average: 2.1ms
Min: 1.8ms
Max: 3.4ms
Matches: 1,247
================================================================================
Subprocess Ripgrep (Traditional)
================================================================================
Average: 14.3ms
Min: 12.1ms
Max: 18.7ms
Matches: 1,247
================================================================================
SPEEDUP: 6.8x faster
TIME SAVED: 12.2ms per search
================================================================================
Scenario: Fix a bug (10 grep calls, 3 file reads)
| Architecture | Tool Overhead | LLM Inference | Total Turn Time |
|---|---|---|---|
| Traditional (laptop + remote LLM) | 150ms (10Γ15ms) | 500ms | 650ms |
| Curserve (co-located) | 10ms (10Γ1ms) | 500ms | 510ms |
| Improvement | 93% faster | - | 22% faster |
Setup: Single server with 1x H100 GPU, 100 concurrent users
| Metric | Traditional | Curserve |
|---|---|---|
| Supported users | ~10-20 (network bottleneck) | 100+ |
| GPU utilization | 40-60% (waiting on I/O) | 85-95% |
| Tool latency (p50) | 15ms | 2ms |
| Tool latency (p99) | 80ms | 8ms |
- Rust (1.70+): For building
mem-search-service - Node.js (20+): For building
qwen-code-ipc - Git submodules: Both components are in this repo
git clone https://github.com/your-org/curserve.git
cd curserve
git submodule update --init --recursivecd mem-search-service
cargo build --releaseBinary will be at: ./target/release/mem-search-service
cd ../qwen-code-ipc
npm install
npm run buildBinary will be at: ./dist/cli.js
./mem-search-service/target/release/mem-search-serviceOutput:
================================================================================
CURSERVE Memory Search Service
================================================================================
Request listener started on /tmp/mem_search_service_requests.sock
Worker thread started
Service running. Press Ctrl+C to stop.
node qwen-code-ipc/dist/cli.js ~/my-codebaseOr create a shell alias:
alias curserve='node /path/to/curserve/qwen-code-ipc/dist/cli.js'Then:
curserve ~/my-project
> Find all TODO comments and prioritize them by importancecd mem-search-service
cargo test./target/release/benchmark /path/to/codebase "search pattern" 100# Terminal 1: Start daemon
./target/release/mem-search-service
# Terminal 2: Run qwen-code-ipc
cd ../qwen-code-ipc
npm test- Request socket (shared):
/tmp/mem_search_service_requests.sock - Response sockets (per-client):
/tmp/qwen_code_response_{pid}.sock
Request:
{
"type": "alloc_pid",
"pid": 12345,
"repo_dir_path": "/home/user/my-codebase"
}Response:
{
"response_status": 1,
"text": "Allocated 347 files"
}Request:
{
"type": "request_ripgrep",
"pid": 12345,
"pattern": "fn\\s+\\w+",
"paths": ["/home/user/my-codebase/src"],
"options": {
"line_number": true,
"ignore_case": false,
"threads": 4
}
}Response:
{
"response_status": 1,
"text": "src/main.rs:42:fn main() {\nsrc/lib.rs:10:fn search() {"
}{
"response_status": 0,
"error": "PID 12345 has no allocated codebase. Call alloc_pid first."
}- Per-user codebases: Each client gets an isolated memory-mapped view
- Unix permissions: Socket access controlled by filesystem permissions
- Process isolation: qwen-code-ipc runs as the user's process (SSH session)
// Future work: Implement codebase eviction
// When RAM > threshold:
// - Evict least-recently-used codebases
// - Re-allocate on next grep requestβββββββββββββββββββββββββββββββββββββββββββββββ
β Curserve Server β
β β
β βββββββββββββββββββββββββββββββββββββββββ β
β β mem-search-service (privileged) β β
β β β’ Runs as root or dedicated user β β
β β β’ Socket permissions: 0770 β β
β βββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββ β
β β SSH Access (per-user) β β
β β β’ Users SSH in β β
β β β’ Run: curserve [workspace] [prompt] β β
β β β’ qwen-code-ipc runs as their user β β
β βββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββ β
β β vLLM (GPU inference) β β
β β β’ Shared across all users β β
β β β’ Rate limiting per user/team β β
β βββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββ
$ ./mem-search-service
Error: Address already in useSolution:
rm /tmp/mem_search_service_requests.sock
./mem-search-serviceqwen-code-ipc: Failed to connect to socketCheck:
- Is
mem-search-servicerunning? - Socket permissions:
ls -l /tmp/mem_search_service_requests.sock - Firewall/SELinux blocking Unix sockets?
$ ps aux | grep mem-search-service
user 12345 2.5 8.3 8472192 ...Analysis:
- Each codebase uses ~15-30MB (text files only)
- 100 codebases = ~2-3GB RAM (very manageable)
- Future: Implement LRU eviction for 1000+ users
- Memory-mapped grep in Rust
- Unix domain socket IPC
- qwen-code fork with IPC integration
- Basic multi-tenancy support
- File watching & auto-reload on changes
- Codebase eviction/LRU caching
- Rate limiting per user/team
- Monitoring & telemetry (Prometheus/Grafana)
- Docker deployment support
- Distributed codebases (multi-node)
- Copy-on-write sharing (same codebase, multiple users)
- Incremental updates (git pull without full reload)
- Advanced tools:
find,tree,analyzevia IPC
- Suffix tree indexing for hot files
- Predictive codebase loading
- GPU-accelerated search (cuDF/RAPIDS)
- Zero-copy IPC (shared memory)
- Memory Search Service: See
mem-search-service/README.md - qwen-code-ipc: See
qwen-code-ipc/README.md - IPC Protocol: See
docs/ipc-protocol.md(coming soon) - Deployment Guide: See
docs/deployment.md(coming soon)
We welcome contributions! Key areas:
- Performance: Optimize search algorithms, memory usage
- Reliability: Error handling, crash recovery
- Scalability: Better multi-tenancy, distributed support
- Tools: Add more IPC-backed tools (find, tree, etc.)
See CONTRIBUTING.md for guidelines.
MIT License - see LICENSE for details.
- ripgrep by BurntSushi: Core search engine
- qwen-code by QwenLM: Base agent framework
- vLLM: High-performance inference engine
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: team@curserve.ai (coming soon)
Built with β€οΈ for the coding agent community
Star β this repo if you find it useful!