GitHub - scitix/SiMM: SiMM: Scalable in-Memory Middleware

Scalable in-Memory Middleware

About

SiMM is a high-performance, scalable Key-Value (KV) cache engine designed for LLM inference workloads. It addresses the critical bottlenecks in long-context prompts and multi-turn agent interactions by providing a dedicated, fast memory pool for KV cache storage.
Seamlessly integrated with leading inference engines like SGLang and vLLM, enabling offload and retrieve KV caches from SiMM, inference engines can bypass redundant prefill computations, significantly save GPU cycles and drastically reduce Time to First Token (TTFT). SiMM is already proven and validated in production environments.

Key Features

Distributed Architecture: Provides a massive, multi-node KV cache storage capable of supporting extremely long-context demands
Linear Scalability: Delivers linear growth in both capacity and throughput by horizontally scaling SiMM data servers via one-hop hash based addressing, meeting dynamic KV cache requirements
Low Latency: Reduces KV cache I/O (PUT/GET) latency by up to 9x compared to industry-leading alternatives through an end-to-end zero-copy mechanism and a highly efficient software stack
High Bandwidth: Maximizes I/O bandwidth by fully utilizing ALL RDMA NICs of client nodes (effectively eliminating the bottlenecks exposed in DualPath)
Ease of Use: Offers seamless integration with popular inference engines, with deployment orchestrated via Kubernetes (K8s) for production-grade reliability

Under multi-turn long-context LLM workloads with significant KV cache reuse, SiMM drastically reduces prefill latency (TTFT) by transforming the prefill phase from a compute-heavy task into a high-speed I/O retrieval operation. Under 32K context length, SiMM achieves 3.1x speedup over "No Cache" configuration and 2.1x speedup over local CPU caching, 1.2x outperforming industry-leading alternatives [details].

Architecture Overview

Core Modules	Description	Code Open Source
Cluster Manager	Controller of cluster management, ensures cluster scaling and optimal data distribution, minimum number: 1	✅ Yes
Data Server	Data nodes to host user KVCache and related metadata, minimum number: 1	✅ Yes
Client	Lightweight Python/C++ SDK integrated in LLM inference engine to offload and retrieve KV Caches from SiMM	✅ Yes
Observability	Internal observability modules to collect metrics/io_traces and make them visible	✅ Yes
SiCL	High-performance network transport library on RDMA (IB and RoCEv2)	⚠️ No (installation package)

Cluster Manager

Handles node lifecycle and health monitoring by running Node-Join and Heartbeat protocol
Maintains and propagates Shard-to-Data-Server routing table, allowing efficient data addressing
Manages cluster scaling, migration tasks to ensure a balanced distribution of data and load (WIP)

Data Server

Manages host CPU memory utilizing shared memory (shm) file formats
Maintains per-shard KV tables to store data, utilizing key hash values for efficient indexing and querying
Employs multi-slab designs with best-fit allocation to efficiently store variable-sized KV data
Supports various KV eviction policies (WIP)

Client

Aggregates small KV cache I/O into batches to maximize transfer bandwidth utilization and minimize per-request overhead
Enables zero-copy data path by leveraging user-managed Memory Region (MR) buffer
Integrated fail-fast mechanisms to mitigate tail latency

Observability

Supports metrics collection like throughput, latency, iops, error stat, see usage [Metrics]
Supports trace point injection into any functions in I/O path to help performance profiling, see usage [I/O Trace]

SiCL Network Transport Library

⚠️ Not open-sourced, install it by package and use it in shared library way
Supports RDMA-native RPC and one-sided RDMA semantics
Zero-copy data transfer for both host and GPU memory
Supports both busy-polling and event-driven execution modes

Comparison with industrial-leading platform

Performance

Test Environment

All tests are executed on identical nodes with below configurations:

LLM serving node : H200 x 8, 400Gbps x 8 RNICs, 192 CPU Cores, 2TB memory, Ubuntu 22.04, GPU Driver 570.86.15, CUDA 12.9
SiMM cluster : Data Server x 3, Cluster Manager x 1, 3TB host memory in total
Mooncake cluster : Store Server x 3, Master Server x 1, 3TB host memory in total

KV I/O Performance

Test Settings: simm_kvio_bench or mooncake stress_workload_test; 32 threads with sync put/get interfaces for Throughput & IOPS cases

Benchmark with vLLM/LMCache

SiMM is a storage backend of LMCache in vLLM (patch is being prepared to LMCache project).

Test Settings: use vllm/benchmark/multi_turn, DeepSeek-R1, TP=8, Random dataset, 40 rounds

Test Settings: LLaMa3.3-70B, TP=4, Random dataset, 2 rounds(Identical input prompts across rounds to ensure a theoretical full cache hit in the second round)

Benchmark with SGLang/HiCache

SiMM is a storage backend of HiCache in SGLang, #PR18016.

Test Settings: SGLang v0.5.8, with benchmark/hicache/bench_multiturn.py; rounds=10, parallel=32.

Launch SiMM Service

Env Requirements

⚠️IMPORTANT

Strongly recommand to run SiMM modules on Ubuntu 22.04 & 24.04

RDMA Driver & SDK installed, such as Mellanox OFED.

Python 3.10, virtual environment is recommended.

Running in Kubernetes

The easiest way to deploy SiMM into your Kubernetes cluster is to use the Helm Chart. See SiMM helm chart.

# build and push SiMM image
bash ./build_docker.sh --registry docker.io --tag simm:latest --py_ver 3.10

# use the image to deploy SiMM
# replicaCount is the data server num, default is 3
helm install simm ./k8s/simm --set image.repository=docker.io/simm:latest --set replicaCount=3

Build with source codes

Prepare

# Clone SiMM codes
git clone https://github.com/Scitix/SiMM simm

# Clone thirdparty submodules
cd simm
git submodule update --init --recursive

# configure env dependency
bash ./configure.sh

# [optinal] if meet failures by script, please try below manual install commands
apt update -y
cd third_party/folly/
sudo ./build/fbcode_builder/getdeps.py install-system-deps --recursive
sudo apt-get install -y libgflags-dev libgoogle-glog-dev libacl1-dev libprotobuf-dev protobuf-compiler libcurl4-openssl-dev libssl-dev
sudo apt-get install -y libboost-all-dev libdouble-conversion-dev
pip install nanobind

# Install SiCL Network Library
# For SiCL is not open-sourced, you shold install it by package and use it in shared library way, just use ./configure.sh to wget and install it automatically.

Build

Use build.sh to build SiMM, script usage:

./build.sh 
  --mode    : build mode, values including
    release    : release mode, with -O2 optimization
    debug      : debug mode, with -g
    relwithdeb : release mode with debug info 
    minsizerel : creating smallest possible size binaries
  --test    : trigger test codes build under ./tests subdirectory
  --metric  : enable SiMM metrics collection, default off
  --trace   : enable SiMM IO latency tracing, default off
  --clean   : will cleanup binaries in build/{mode}/ subdirectory
  --verbose : print more build logs

Examples

# build release mode
./build.sh --mode release

# build debug mode
./build.sh --mode debug

# release clean build 
./build.sh --mode release --clean

# build test codes under debug mode
./build.sh --mode debug --test

# print more build logs under debug mode
./build.sh --mode debug --verbose

# enable request metrics collection functionality
./build.sh --mode release --metric

# enable request trace functionality
./build.sh --mode release --trace

Binaries Output Path

# release mode output
./build/release/bin
./build/release/lib

# debug mode output
./build/debug/bin
./build/debug/lib

# tests binaries output
./build/debug/bin/unit_tests/
./build/release/bin/unit_tests/

# tools binaries output
./build/debug/bin/tools/
./build/release/bin/tools/

# same for debwithdeb / minsizerel mode

Deploy

Please refer to Deploy Guide

Use SiMM in LLM

Run in SGLang

SiMM integrates with SGLang through HiCache. For detailed usage instructions, see SGLang HiCache documentation

python3 -m sglang.launch_server \
  --model deepseek-ai/DeepSeek-R1 \
  --trust-remote-code \
  --tp 8 --mem-fraction-static 0.75 \
  --page-size 64 \
  --enable-hierarchical-cache \
  --hicache-ratio 1.1 \
  --hicache-mem-layout page_first_direct \
  --hicache-io-backend direct \
  --hicache-storage-backend simm \
  --hicache-write-policy write_through \
  --hicache-storage-prefetch-policy timeout \
  --hicache-storage-backend-extra-config '{"manager_address":"127.0.0.1:30001"}' \
  --port 8080

Run in vLLM

SiMM integrates with vLLM through LMCache. For detailed usage instructions, see LMCache documentation.

LMCACHE_CHUNK_SIZE=64 \
LMCACHE_LOCAL_CPU=True \
LMCACHE_MAX_LOCAL_CPU_SIZE=1024 \
LMCACHE_REMOTE_URL=simm://127.0.0.1:30001/?use_sync=1 \
LMCACHE_EXTRA_CONFIG={"save_chunk_meta":false} \
LMCACHE_NUMA_MODE=auto \
vllm serve Qwen/Qwen2.5-7B-Instruct/ \
   --served-model-name "Qwen/Qwen2.5-7B-Instruct" \
   --seed 42 \
   --max-model-len 16384 \
   --disable-log-requests \
   --enforce-eager \
   -tp 1 \
   --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
   --gpu-memory-utilization 0.8

Roadmaps

Integrate with main-stream LLM inference platforms (SGLang, vLLM)
Integrate with main-stream KVCache management system (FlexKV PR)
Support node-level GPU memory pool
Support multiple storage tiers (SSD / Remote Filesystem)
Support lossless KV compression and sparse KV retrieval

Documentations

Contact us

Slack Channel: simm-open-source

License

SiMM is under Apache License v2.0, please see LICENSE for details

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
cmake		cmake
docs		docs
include		include
k8s/simm		k8s/simm
licenses		licenses
python		python
scripts		scripts
src		src
tests		tests
third_party		third_party
tools		tools
utils		utils
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
build_docker.sh		build_docker.sh
configure.sh		configure.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scalable in-Memory Middleware

About

Key Features

Architecture Overview

Comparison with industrial-leading platform

Performance

Test Environment

KV I/O Performance

Benchmark with vLLM/LMCache

Benchmark with SGLang/HiCache

Launch SiMM Service

Env Requirements

Running in Kubernetes

Build with source codes

Prepare

Build

Deploy

Use SiMM in LLM

Run in SGLang

Run in vLLM

Roadmaps

Documentations

Contact us

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scalable in-Memory Middleware

About

Key Features

Architecture Overview

Comparison with industrial-leading platform

Performance

Test Environment

KV I/O Performance

Benchmark with vLLM/LMCache

Benchmark with SGLang/HiCache

Launch SiMM Service

Env Requirements

Running in Kubernetes

Build with source codes

Prepare

Build

Deploy

Use SiMM in LLM

Run in SGLang

Run in vLLM

Roadmaps

Documentations

Contact us

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages