Fusion is a cross-platform Go CLI for model and kernel optimization workflows. The long-term goal is one CLI that can plan GPU-specific optimizations, generate Triton or CUDA kernel candidates, run them on local or remote Linux GPU machines, benchmark before vs after, and keep the winning variants.
Today Fusion already gives you a useful foundation:
- ModelsLab-backed chat sessions and Modelslab-only auth
- browser-based
fusion loginthat hands off from modelslab.com back into the local CLI - an embedded optimization knowledge base with GPU profiles, strategies, skills, examples, and source references
- a public Markdown-first
knowledgebase/corpus that compiles into the shipped SQLite index - a packed SQLite BM25 search index generated from the curated knowledge files
- host capability detection with explicit warnings on unsupported setups
- target management for
local,ssh, andsim - benchmark and profile execution against those targets
- persisted artifacts for before/after comparisons
- target-aware optimization planning
- optimization sessions that persist retrieved context, backend candidates, and stage artifacts
- CuTe DSL, Triton, and CUDA workspace scaffolding with build and verify flows
Fusion can run on macOS for planning, artifact management, ModelsLab setup, and SSH orchestration. Real CUDA compilation, profiling, and authoritative kernel performance validation still need a Linux machine with NVIDIA tooling.
What works today:
fusionandfusion chatas a chat-first agent entry pointfusion env detect|doctorfusion generate keychainfusion optimize planwith a curated GPU and optimization knowledge basefusion kb list|search|show|contextbacked by an embedded SQLite BM25 indexfusion update kbto rebuild a local Markdown-backed knowledge snapshot and SQLite indexfusion optimize session create|list|showfusion optimize cute init|build|verify|benchmarkfusion optimize triton init|build|verify|benchmarkfusion optimize cuda init|build|verify|benchmarkfusion target add|list|show|remove|defaultfusion target execandfusion target copyfusion benchmark runandfusion benchmark comparefusion profile run- release packaging for Linux, macOS, and Windows
What is not implemented yet:
- Modelslab-backed Triton/CUDA/CuTe code generation
- automatic optimization loops that generate, run, score, and retain winning kernels
Linux and macOS:
curl -fsSL https://raw.githubusercontent.com/ModelsLab/fusion/main/scripts/install.sh | shPin a specific release or install into a custom directory:
curl -fsSL https://raw.githubusercontent.com/ModelsLab/fusion/main/scripts/install.sh | \
FUSION_VERSION=v0.2.1 INSTALL_DIR="$HOME/.local/bin" shWindows PowerShell:
irm https://raw.githubusercontent.com/ModelsLab/fusion/main/scripts/install.ps1 | iexgo install github.com/ModelsLab/fusion/cmd/fusion@latestmake build
./bin/fusion versionPush a version tag and GitHub Actions will publish tar.gz and .zip assets for:
- Linux
amd64,arm64 - macOS
amd64,arm64 - Windows
amd64,arm64
git tag v0.2.1
git push origin v0.2.1The release workflow uploads matching archives plus checksums.txt.
Connect Fusion to ModelsLab in the browser:
fusion loginOr configure it manually for CI or headless environments:
fusion auth set \
--token "$MODELSLAB_API_KEY" \
--model openai-gpt-5.4-proStore Hugging Face and GitHub tokens for model and private-repo workflows:
fusion hf login --token "$HF_TOKEN"
fusion github login --token "$GITHUB_TOKEN"Validate them:
fusion hf whoami
fusion github whoamiFusion shell commands automatically expose:
HF_TOKEN,HUGGING_FACE_HUB_TOKENGITHUB_TOKEN,GH_TOKEN
That lets the agent download models from Hugging Face and work against private GitHub repos. For private HTTPS git operations, prefer gh commands or git with an Authorization header using $GITHUB_TOKEN instead of embedding secrets into URLs.
Start the interactive agent shell:
fusionBy default, fusion resumes the latest chat session for the current working directory. Start a fresh one explicitly when you want a clean thread:
fusion chat --newResume a saved project session directly:
fusion chat --session latest
fusion chat --session 20260307-120501-my-projectRun a single natural-language turn:
fusion chat "optimize qwen2.5-72b for 4090 decode latency and compare AWQ vs Triton"Inside chat, Fusion can use tools for:
- listing, reading, writing, replacing, and deleting files
- running shell commands locally or on configured targets
- searching the knowledge base
- creating optimization sessions and retrieving skill/context packets
- building optimization plans
- running benchmark and profile workflows
- scaffolding and running CuTe DSL, Triton, and CUDA workspaces
Chat-local commands:
/help
/history 12
/sessions
/resume latest
/new
/model gpt-5
/cd ~/projects/my-model
/save
/tools
/session
/exit
The local slash commands are for session control only. The model still does the real engineering work through Fusion tools.
fusion -h
fusion version
fusion login
fusion hf login --token "$HF_TOKEN"
fusion github login --token "$GITHUB_TOKEN"
fusion env doctor --backend all --fix-script
fusion kb search "blackwell attention"
fusion optimize plan --gpu h100 --workload decode --operator attentionInspect the current host:
fusion env detect
fusion gpu detectSearch the embedded optimization corpus:
fusion kb search "paged attention"
fusion kb show --kind gpu --id rtx4090
fusion kb context --gpu b200 --workload decode --operators attention,kv-cache --precision fp8 --runtime vllmRebuild a private local knowledge base from Markdown docs:
fusion update kbThis bootstraps ~/.config/fusion/knowledgebase/ if needed, rebuilds the SQLite index under ~/.config/fusion/knowledge/, and makes future Fusion runs prefer that rebuilt local knowledge base.
Fusion chat sessions are stored under ~/.config/fusion/sessions/.
fusionauto-resumes the latest session for the current working directory.fusion chat --newstarts a clean thread in the same directory.fusion chat --session latestresumes the newest session for the current directory./sessionslists recent sessions and marks the current one with*./resume <id>switches sessions without leaving the shell.
See what the current machine can and cannot do:
fusion env detect
fusion env doctor --backend all --fix-scriptRegister a remote Ubuntu target over SSH:
fusion generate keychain --name gpulabPaste the printed public key into your GPU provider, then register the target with the generated private key path:
fusion target add \
--name lab-4090 \
--mode ssh \
--host 203.0.113.10 \
--user ubuntu \
--gpu rtx4090 \
--key ~/.ssh/id_ed25519 \
--remote-dir ~/fusion \
--defaultRegister a non-authoritative proxy/sim target:
fusion target add \
--name sim-h100-on-4090 \
--mode sim \
--gpu h100 \
--proxy-gpu rtx4090List configured targets:
fusion target list
fusion target show --name lab-4090Run a command directly on a target:
fusion target exec --name lab-4090 --command "nvidia-smi"Copy files to a remote target:
fusion target copy \
--name lab-4090 \
--src ./kernels \
--dst ~/fusion/kernels \
--recursivePlan optimizations for a configured target:
fusion optimize plan \
--target lab-4090 \
--model llama-3.1-8b \
--workload decode \
--operator attention \
--operator kv-cache \
--precision bf16Create a CuTe DSL workspace and compile or verify it on a target:
fusion optimize cute init \
--name cute-add-one \
--output ./cute-add-one \
--gpu-arch sm90
fusion optimize cute build \
--workspace ./cute-add-one \
--target lab-4090 \
--gpu-arch sm89
fusion optimize cute verify \
--workspace ./cute-add-one \
--target lab-4090 \
--gpu-arch sm89
fusion optimize cute benchmark \
--workspace ./cute-add-one \
--target lab-4090 \
--gpu-arch sm89Create a session-backed Triton or CUDA candidate loop:
fusion optimize session create \
--name qwen-b200 \
--gpu b200 \
--model qwen2.5-72b \
--workload decode \
--operator attention \
--operator kv-cache \
--precision fp8 \
--runtime vllm \
--query "optimize qwen decode attention on b200"
fusion optimize triton init \
--session <session-id> \
--name attention-triton
fusion optimize cuda init \
--session <session-id> \
--name attention-cuda
fusion optimize triton build \
--session <session-id> \
--candidate triton-attention-triton
fusion optimize triton verify \
--session <session-id> \
--candidate triton-attention-triton
fusion optimize session show --id <session-id>The same session flow now works for CuTe candidates:
fusion optimize cute init \
--session <session-id> \
--name attention-cute
fusion optimize cute benchmark \
--session <session-id> \
--candidate cute-dsl-attention-cuteOr plan for a GPU directly:
fusion optimize plan \
--gpu rtx4090 \
--model llama-3.1-8b \
--workload decode \
--operator attention \
--operator kv-cache \
--precision bf16Run a benchmark and compare before/after artifacts:
fusion benchmark run \
--target lab-4090 \
--name before \
--command "python benchmark.py"
fusion benchmark run \
--target lab-4090 \
--name after \
--command "python benchmark_optimized.py"
fusion benchmark compare \
--before ~/Library/Application\\ Support/fusion/artifacts/benchmarks/<before>.json \
--after ~/Library/Application\\ Support/fusion/artifacts/benchmarks/<after>.jsonPass metrics explicitly when your benchmark command does not print them:
fusion benchmark run \
--target lab-4090 \
--name before \
--command "python benchmark.py >/tmp/bench.log" \
--metrics "tokens_per_sec=142.5 latency_ms=7.9"Run a profile command on a remote or local target:
fusion profile run \
--target lab-4090 \
--tool ncu \
--command "ncu --set full python benchmark.py"Fusion supports three execution modes:
local: run on the current machinessh: run on a remote Linux machine over SSHsim: use a proxy machine or proxy GPU while targeting another GPU profile
Recommended usage:
- use
localwhen the current machine actually has the intended NVIDIA stack - use
sshfor real Ubuntu GPU boxes - use
simfor rough iteration, compatibility work, and non-authoritative proxy runs
sim mode is intentionally explicit about its limitations. It does not emulate an H100, B200, or any other GPU with performance fidelity on top of a different GPU. It is useful for:
- iterating against a target GPU profile
- validating command and artifact flow
- rough proxy benchmarking with warnings
Authoritative performance numbers still require the real target GPU.
Fusion reports host limitations with fusion env detect.
On macOS, expect:
- planning and artifact workflows to work
- SSH orchestration to work
- local CUDA compilation to be unavailable unless the host actually has a supported NVIDIA stack
- local Nsight profiling to be unavailable in normal modern macOS setups
In practice, macOS is best treated as a control plane for:
- planning
- ModelsLab login and session setup
- target registration
- remote execution over SSH
- comparing benchmark and profile artifacts
Fusion stores artifacts under the user config directory. On macOS this is typically:
~/Library/Application Support/fusion/artifactsCurrent artifact types:
benchmarks/*.jsonprofiles/*.json
Benchmark metrics are parsed from:
- JSON printed to stdout, for example
{"tokens_per_sec": 125, "latency_ms": 8} - key/value lines, for example
tokens_per_sec=125 - the optional
--metricsflag
fusion benchmark compare compares wall time plus any common metric keys found in both artifacts.
Core commands:
fusion loginfusion auth login|show|set|logoutfusion env detectfusion gpu detect|normalizefusion kb list|search|show|contextfusion optimize planfusion target add|list|show|remove|default|exec|copyfusion benchmark run|comparefusion profile run
Run the full test suite:
go test ./...Current tests cover:
- knowledge-base loading and search
- optimization planner scoring
- artifact metric parsing
- target validation
- target resolution
- local and sim execution behavior
- local file and directory copy behavior
- benchmark comparison helper logic
Run formatters before opening a PR:
gofmt -w $(find . -name '*.go' -print)cmd/fusion: CLI entrypointinternal/config: local config and ModelsLab token storageinternal/modelslab: ModelsLab API and browser-login constantsinternal/system: host and toolchain detectioninternal/targets: target validation and execution semanticsinternal/runner: local and SSH command/copy executioninternal/artifacts: benchmark and profile artifact storageinternal/kb: embedded knowledge base loader, SQLite BM25 search, and context packet compilerinternal/optimize: optimization planner and recommendation engineknowledge: source-backed GPU, strategy, skill, example, and search-index assets embedded into the binaryscripts: install helpers and knowledge-index generation.github/workflows: CI and tagged release pipelines.goreleaser.yaml: cross-platform packaging configdocs: architecture and roadmap notes
Fusion is not yet a full autonomous kernel writer. The missing pieces are important:
- Modelslab-backed Triton/CUDA generation
- kernel correctness verification
- Triton/CUDA compile pipelines
- session-oriented optimization loops
- promotion logic for winning kernels per GPU family and workload shape
Those pieces should build on the current target, benchmark, profile, and artifact foundation instead of bypassing it.
- Modelslab-backed Triton/CUDA kernel generation inside the CLI
- correctness verification commands for generated kernels
- first-class compile commands for Triton and CUDA C++
- structured optimization sessions that chain plan, generate, benchmark, profile, and compare