No Python. No terminal. No config files.
Download the DMG, drag to Applications, and run AI models locally in seconds.
Features • Screenshots • API Server • Image Generation • JANG Quantization • Requirements • Build • 한국어
MLX Studio is a complete desktop app for running LLMs, VLMs, and image generation models locally on your Mac. No cloud, no API keys, no data leaving your machine. Supports every model on mlx-community -- Qwen, Llama, Mistral, Gemma, Phi, DeepSeek, and thousands more. Built on vMLX Engine and Apple's MLX framework.
JANG 2-bit destroys MLX 4-bit on MiniMax M2.5:
Quantization MMLU (200q) Size JANG_2L (2-bit) 74% 89 GB MLX 4-bit 26.5% 120 GB MLX 3-bit 24.5% 93 GB MLX 2-bit 25% 68 GB Adaptive mixed-precision quantization keeps critical layers at higher precision while compressing the rest. Check scores at jangq.ai. Models at JANGQ-AI.
Download the latest DMG -- one file, ready to go.
- Download
vMLX-X.Y.Z-arm64.dmg - Open the DMG and drag to Applications
- Launch -- that's it
All releases are code-signed and notarized by Apple for macOS Gatekeeper. No Homebrew, no pip, no Xcode required.
The vMLX inference engine is published on PyPI as vmlx -- same engine that powers the desktop app, available as a standalone CLI. This is real, published software with 1,894+ tests.
# Recommended: use uv (fast, no venv hassle)
brew install uv
uv tool install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
# Or with pipx (isolates from system Python)
brew install pipx
pipx install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
# Or with pip in a virtual environment
python3 -m venv ~/.vmlx-env && source ~/.vmlx-env/bin/activate
pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bitNote: On macOS 14+,
pip install vmlxwithout a venv will fail with "externally-managed-environment". Useuv,pipx, or create a venv first.
Once running, your local OpenAI-compatible API server is live at http://localhost:8000. Point any OpenAI or Anthropic SDK client at it.
- Launch MLX Studio from Applications
- Pick a model -- browse HuggingFace models in the Server tab, or enter a repo name (e.g.,
mlx-community/Qwen3-8B-4bit) - Start the session -- the model downloads automatically and the server starts
- Chat -- switch to the Chat tab and start talking
That's it. The app manages the entire Python engine, model downloads, and server lifecycle for you.
Run any MLX model from HuggingFace -- thousands of models, zero configuration:
- Text LLMs -- Qwen 2/2.5/3/3.5, Llama 3/3.1/3.2/3.3/4, Mistral/Mixtral/Codestral, Gemma 2/3, Phi-3/4, DeepSeek V2/V3/R1, GLM-4/4.7, Nemotron, MiniMax, Kimi, Step, XVERSE, Yi, InternLM, ChatGLM, CodeLlama, and any mlx-lm compatible model
- Vision LLMs (VL) -- Qwen-VL, Qwen2.5-VL, Qwen3.5-VL, Pixtral, InternVL, LLaVA, Gemma 3n, Phi-3-Vision -- send images and video directly in chat
- Mixture-of-Experts -- Qwen 3.5 MoE, Mixtral 8x7B/8x22B, DeepSeek V2/V3, MiniMax M2.5, Llama 4 Scout/Maverick
- Hybrid SSM Models -- Nemotron-H, Jamba, GatedDeltaNet (Mamba + Attention architectures with dedicated hybrid cache)
- Image Generation -- Flux Schnell/Dev, Z-Image Turbo, FLUX.2 Klein 4B/9B (via mflux)
- Image Editing -- Qwen Image Edit (instruction-based editing, full precision)
- Audio -- Kokoro TTS, Whisper STT, Qwen3-Audio (via mlx-audio)
- JANG Models -- Adaptive mixed-precision quantized models from JANGQ-AI, stay quantized in GPU memory via native
QuantizedLinear - GGUF Import -- Convert GGUF models to MLX format directly in-app
Every session launches a full API server. Point any OpenAI SDK client at your local endpoint:
POST /v1/chat/completions-- Chat Completions API with streaming, tool calling, vision, structured outputPOST /v1/responses-- OpenAI Responses API (agentic format) with streamingPOST /v1/completions-- Text completionsPOST /v1/images/generations-- Image generation (Flux/Z-Image models, OpenAI format withusagefield)POST /v1/images/edits-- Image editing (Qwen Image Edit, instruction-based)POST /v1/embeddings-- Text embeddings with dimension control and batch processingPOST /v1/rerank-- Document rerankingPOST /v1/audio/speech-- Text-to-speech (Kokoro TTS)POST /v1/audio/transcriptions-- Speech-to-text (Whisper)GET /v1/models-- List loaded modelsGET /health-- Server health with VRAM usage, queue length, load times
Drop-in replacement for the Anthropic Claude API:
POST /v1/messages-- Anthropic Messages API format- Anthropic SDK tool calling format (auto-translated to internal format)
- Vision/multimodal support via Anthropic content blocks
- Use the Anthropic Python/TypeScript SDK -- just change the
base_urlto your local server - Copy-paste code snippets in the API tab for curl, Python, and JavaScript
Auto-detected tool call parsers for every major model family:
- Qwen (qwen3, qwen2.5) --
<tool_call>XML format - Llama 3 --
<function=name>format - Mistral --
[TOOL_CALLS]format - Hermes --
<tool_call>JSON format - DeepSeek -- function call blocks
- GLM-4.7 -- GLM tool format
- MiniMax -- MiniMax function calling
- Nemotron -- NVIDIA Nemotron tool format
- Granite -- IBM Granite format
- Functionary -- Functionary v3 format
- XLAM -- Salesforce xLAM format
- Kimi -- Moonshot Kimi format
- Step-3.5 -- StepFun format
- Auto-detection from
model_typein config.json with regex name fallback
26+ Built-in Tools:
- File I/O -- read, write, edit, patch, copy, move, delete, create directory, list directory, file info, insert text, replace lines, directory tree
- Search -- ripgrep file search with regex and glob, glob file finder, unified diff
- Execution -- shell commands (60s timeout), background processes (5m auto-kill), process output polling
- Web -- DuckDuckGo search, Brave Search API, URL fetch with HTML-to-text
- Developer -- token counter, regex find-replace across files, git operations, clipboard read/write, diagnostics (TypeScript/ESLint/Python linting)
- Interactive --
ask_usertool for human-in-the-loop interrupts - Per-category toggles: enable/disable file, search, shell, web tools independently
- Auto-continue agent loops (up to 10 tool iterations per request)
- MCP (Model Context Protocol) -- connect external tool servers, merge tool definitions, execute MCP tools via API
Collapsible thinking blocks with dedicated parsing for reasoning models:
- Qwen3 / Qwen3.5 --
<think>...</think>blocks - DeepSeek-R1 -- DeepSeek reasoning format
- OpenAI GPT-OSS / GLM-4.7 -- GPT-OSS thinking format
- Phi-4-reasoning -- reasoning content extraction
- Enable/disable thinking per request
- Reasoning effort control (low/medium/high)
- Streaming reasoning content with proper tokenization
Full multimodal input support for vision-language models:
- Images -- PNG, JPEG, WebP via base64 or URL (up to 50 MB)
- Video -- MP4, MOV, WebM via base64 or URL (up to 200 MB), smart frame extraction (8-64 frames), configurable FPS
- Audio -- Base64 or URL audio input (Qwen3-Audio)
- Image detail levels: auto, low, high
- Dedicated MLLM cache for image/video embeddings (separate from KV cache)
- Send images directly in chat to any VL model
Production-grade multi-user serving:
- Continuous batching -- handle 32+ concurrent requests with dynamic slot allocation
- Prefill batching -- batch prompt processing with configurable batch size (prevents Metal GPU timeouts)
- Completion batching -- batch token generation across sequences
- Stream interval control -- configure streaming frequency
- Request pooling -- efficiently share GPU memory across concurrent sequences
- Rate limiting -- optional per-client request limits
- API key authentication -- optional
--api-keyflag for secured access
Multi-tier caching for maximum throughput and memory efficiency:
- L1: Memory-Aware Prefix Cache -- token-level semantic caching with LRU eviction, configurable memory allocation
- L1 alt: Paged KV Cache -- block-aware cache with reduced fragmentation for long contexts
- L2: Disk Cache -- persistent spillover to disk for large context windows
- L2 alt: Block Disk Store -- block-level disk persistence
- KV Quantization -- q4/q8 quantized KV cache at storage boundary (2-4x memory savings, no accuracy loss)
- Hybrid SSM Cache -- dedicated cache for Mamba + Attention architectures (Nemotron-H, Jamba, GatedDeltaNet)
- Automatic cache type selection based on model architecture
- Cache warming API (
POST /v1/cache/warm) for pre-loading common prompts - Cache stats API (
GET /v1/cache/stats) for monitoring hit rates and memory usage
Full control over text generation:
- Temperature (0.0 - 2.0) -- creativity control
- Top-P (0.0 - 1.0) -- nucleus sampling
- Top-K (integer) -- top-K token filtering
- Min-P (0.0 - 1.0) -- minimum probability threshold
- Repetition Penalty -- penalize repeated tokens
- Stop Sequences -- custom stopping strings
- Max Tokens -- output length limit (up to 131072)
- Request Timeout -- per-request timeout override
- Structured Output --
response_formatwithjson_objectorjson_schemamodes for guaranteed valid JSON - Streaming with proper Unicode handling (emoji, CJK, Arabic multi-byte characters)
- Usage stats in streaming responses (
stream_options.include_usage)
Convert models directly in-app via the Tools tab:
- 16-bit to MLX -- convert HuggingFace safetensors to MLX format
- 16-bit to quantized -- quantize to 2-bit, 4-bit, or 8-bit MLX
- GGUF to MLX -- import GGUF models into MLX safetensors format
- MLX to JANG -- adaptive mixed-precision quantization (different bits per layer type)
- Model Inspector -- view config.json, architecture, layer structure
- Model Doctor -- diagnostic checks (load test, token count, memory estimation)
- Progress tracking with real-time status
Generate images locally with Flux and Z-Image models:
- Flux Schnell -- 4-step fast generation
- Flux Dev -- 20-step high-quality generation
- Z-Image Turbo -- fast turbo generation (4-bit and 8-bit)
- Flux Klein -- lightweight 4B parameter model
- Flux Kontext -- subject-consistent editing
- Flux Krea -- aesthetic fine-tuned generation
- Configurable steps, guidance scale, height, width, seed, sampler
- Multiple samplers: euler, euler_ancestral, heun, dpmpp_2m_sde, dpmpp_sde
- Quantized model support (2-bit to 8-bit)
- Image gallery with generation history, save, and settings persistence
- OpenAI-compatible
/v1/images/generationsendpoint withusagefield
Full-featured conversation UI:
- Persistent history -- SQLite (WAL mode) with full message, metrics, and tool call history
- Markdown rendering -- GitHub-flavored markdown with syntax highlighting
- Reasoning display -- collapsible thinking sections for reasoning models
- Tool call display -- inline tool execution with status and results
- Streaming metrics -- live tokens/second, time-to-first-token (TTFT), prompt processing speed, prefix cache hit rate
- System prompts -- per-chat custom system message
- Chat settings -- per-chat overrides for temperature, top-p, top-k, min-p, repetition penalty, max tokens, stop sequences
- Chat folders -- hierarchical organization
- Message search -- full-text search across chat history
- Export/Import -- ShareGPT format
- Voice chat -- STT + TTS integration
- HuggingFace browser -- search, filter by text/image, and download models directly in-app
- Download queue -- multiple concurrent downloads with real-time progress bars and cancel support
- Model size display -- file sizes from safetensors metadata before downloading
- Local model discovery -- auto-scan
~/.mlxstudio/models,~/.cache/huggingface/hub,~/.exo/models, and custom directories - Deduplication -- strict format detection prevents false positive model matches
- Zero-config detection -- reads model config.json to auto-set tool parsers, reasoning parsers, cache types, and chat templates
- 65+ model families in the auto-detection registry with two-tier detection (config.json
model_typeprimary, name regex fallback)
- 5 app modes -- Chat, Server, Image, Tools, API
- Menu bar tray -- live server status, GPU memory, running models, quick controls
- Multi-session -- run multiple models simultaneously on different ports
- Dock icon -- restore on click, close-to-tray support
- Dark and light themes -- system-respecting
- Keyboard shortcuts -- common actions
- Toast notifications -- user feedback
- Update banner -- new version detection
MLX Studio supports standard MLX quantization (4-bit, 8-bit) as well as JANG adaptive mixed-precision -- an advanced format that assigns different bit widths to different layer types for better quality at the same model size.
- Convert in-app via the Tools tab, or via CLI:
vmlx convert model --jang-profile JANG_3M - Pre-quantized models available at JANGQ-AI on HuggingFace
- Stays quantized in GPU memory -- native MLX
QuantizedLinear+quantized_matmul - Compatible with all caching layers (prefix, paged, disk, KV quant)
See the vMLX source repo for profiles and conversion details.
| Requirement | Minimum |
|---|---|
| macOS | 14.0 Sonoma or later |
| Chip | Apple Silicon (M1 / M2 / M3 / M4) |
| RAM | 8 GB (16 GB+ recommended for larger models) |
| Disk | ~500 MB for app; models range from 1-50 GB each |
git clone https://github.com/jjang-ai/vmlx.git
cd vmlx
# Python engine
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
# Electron app
cd panel && npm install && npm run build
npx electron-builder --mac --dir # .app bundle
npx electron-builder --mac dmg # DMG installer| Resource | Link |
|---|---|
| Source Code | github.com/jjang-ai/vmlx |
| PyPI | pypi.org/project/vmlx |
| MLX Models | huggingface.co/mlx-community |
| JANG Models | huggingface.co/JANGQ-AI |
| Website | vmlx.net |
Apache License 2.0
Built by Jinho Jang • eric@jangq.ai • JANGQ AI • Support on Ko-fi
Mac에서 LLM, VLM, 이미지 생성 및 편집 모델을 완전히 로컬로 실행하세요.
JANG 2비트가 MLX 4/3/2비트보다 높은 성능 — 적응형 혼합 정밀도 양자화(JANG_2S, JANG_2.6)가 MiniMax M2.5, Qwen3 등에서 표준 MLX 양자화를 능가합니다. jangq.ai에서 벤치마크 확인. JANGQ-AI에서 사전 양자화 모델 다운로드.
설치: 최신 DMG 다운로드 — 드래그 앤 드롭으로 설치.
| 기능 | 설명 |
|---|---|
| 채팅 | 대화 인터페이스, 도구 호출, 에이전트 코딩 |
| 이미지 생성 | Flux Schnell/Dev, Z-Image Turbo, FLUX.2 Klein |
| 이미지 편집 | Qwen Image Edit (텍스트 지시 기반 편집) |
| 5단계 캐싱 | 프리픽스, 페이지드, KV 양자화, 디스크 캐시 |
| API 서버 | OpenAI + Anthropic 호환 API |
| 30개 도구 | 파일, 웹 검색, Git, 터미널 내장 도구 |
개발자: 장진호 (eric@jangq.ai)
JANGQ AI •
Ko-fi로 후원하기







