A custom cli tool for running local Large Language Models (LLMs) on Apple Silicon using Apple's MLX framework.
- CLI-Only Interface: No server required, pure command-line operation
- Multiple Model Support: Qwen, Mixtral, Llama, Phi-3, DeepSeek Coder, and more
- Apple Silicon Optimized: Leverages MLX for maximum performance on M1/M2/M3 chips
- Configurable: YAML-based configuration for models and inference settings
- Interactive Chat: Full conversational interface with history management
- Built-in Benchmarking: Performance testing and optimization tools
- Streaming: Real-time token-by-token output
- Extensive CLI Options: Rich set of parameters for fine-tuned control
- File I/O Support: Read prompts from files, save outputs and chat history
- Formatted Output: Rich formatting with progress bars and colored output
- macOS with Apple Silicon (M1/M2/M3)
- Python 3.11 or later
- uv (recommended) or pip
# Clone the repository
git clone <your-repo-url>
cd mlx
# Install dependencies with uv (recommended)
make install
# Or install with pip
pip install -e .
# Verify installation
mlx info# 1. List available models
mlx models
# 2. Download a model
mlx download qwen-32b
# 3. Ask a single question (Q&A mode)
mlx chat "Explain quantum computing" --model qwen-32b
# 4. Start an interactive chat session
mlx stream --model qwen-32bThe CLI will automatically authenticate with hugging face
Or manually
brew install huggingface-cli
huggingface-cli loginThe application uses YAML configuration files for flexible model and inference management:
Define available models with their HuggingFace repositories, quantization settings, and memory configurations:
models:
qwen-32b:
name: "Qwen2.5-32B-Instruct"
hf_repo: "Qwen/Qwen2.5-32B-Instruct"
mlx_repo: "mlx-community/Qwen2.5-32B-Instruct-4bit"
quantization:
bits: 4
group_size: 64
context_length: 32768
memory_fraction: 0.85
description: "High-quality 32B parameter model with strong reasoning capabilities"Control default generation parameters, performance settings, and memory management:
inference:
generation:
max_tokens: 512
temperature: 0.7
top_p: 0.9
top_k: 40
repetition_penalty: 1.1
performance:
compile: true
warmup: true
use_metal: true
memory:
clear_cache_after: 5
gc_threshold: 0.8Display system information and current status.
mlx info # Pretty formatted output
mlx info --format json # JSON output
mlx info --format text # Plain text outputList all available models with their status.
mlx models # Basic model list
mlx models --detailed # Detailed information including sizes
mlx models --format json # JSON output for scriptingLoad a model into memory.
mlx load qwen-32b # Load with default settings
mlx load mixtral-8x7b --no-compile # Load without compilation
mlx load phi-3-medium --preset fast # Load with performance preset
mlx load llama-8b --force # Force reload if already loadedAsk a single question and get an answer (Q&A mode).
# Basic Q&A
mlx chat "What is machine learning?" --model qwen-32b
# With custom parameters
mlx chat "Explain Python decorators" --model qwen-32b --max-tokens 1024 --temperature 0.3
# With system prompt preset
mlx chat "How do I center a div?" --preset coding
# Save to custom file
mlx chat "Tell me about AI" --save my_qa.json
# Different output formats
mlx chat "Hello world" --format json
mlx chat "Hello world" --format textInteractive streaming chat session with real-time token generation.
# Basic interactive chat
mlx stream --model qwen-32b
# With custom parameters
mlx stream --model qwen-32b --temperature 0.9 --max-tokens 2048
# Show individual tokens (for debugging)
mlx stream --show-tokens --model deepseek-coder
# With system prompt preset
mlx stream --preset creative --model qwen-32bDownload models without loading them.
mlx download qwen-32b # Download if not exists
mlx download mixtral-8x7b --force # Force re-downloadUnload the current model from memory.
mlx unloadCheck system health and memory usage.
mlx health
mlx health --format jsonRun performance benchmarks.
mlx benchmark # Quick benchmark with default model
mlx benchmark --model qwen-32b --prompts 10 # Extended benchmark
mlx benchmark --output results.json # Save results to fileThe included Makefile provides convenient shortcuts:
make help # Show all available commands
make install # Install dependencies
make models # List models with details
make download MODEL=qwen-32b # Download specific model
make chat MODEL=qwen-32b PROMPT="What is AI?" # Ask single question
make stream MODEL=mixtral-8x7b # Start interactive chat
make benchmark MODEL=phi-3-medium # Run benchmark
make clean # Clean cache and temp filesUse performance presets for different use cases:
# Fast: No compilation, lower memory usage
mlx load qwen-32b --preset fast
# Balanced: Standard settings (default)
mlx load qwen-32b --preset balanced
# Maximum: Full optimization, highest memory usage
mlx load qwen-32b --preset maximumUse system prompts for different scenarios, system prompts are in the /prompts directory, you can add new ones and include them in the chat settings in inference.yaml
mlx chat --preset coding # Programming assistant
mlx chat --preset creative # Creative writing helper
mlx chat --preset academic # Academic research assistantAdd to your shell profile for convenience:
# Add to ~/.zshrc or ~/.bashrc
chmod +x <full_path>/install_aliases.sh
./install_aliases.shM1 Max (32GB)
- Primary:
qwen-32b,mixtral-8x7b,deepseek-coder - Memory fraction: 0.85
M1 Pro (16GB)
- Primary:
llama-8b,phi-3-medium,gemma-7b - Memory fraction: 0.80
M2 Ultra (64GB+)
- Primary:
qwen-32b,mixtral-8x7bwith maximum preset - Memory fraction: 0.90
| Quantization Level | Memory Reduction | Speed Impact | Accuracy Impact | Best For |
|---|---|---|---|---|
| 4-bit | ~75% (32GB → 8GB) | Often faster due to reduced bandwidth | 1-3% degradation typical | Most use cases, limited memory |
| 8-bit | ~50% (32GB → 16GB) | Moderate improvement | <1% degradation | Balance of quality and efficiency |
| 16-bit (None) | Full size | Slower, higher bandwidth | Maximum quality | Accuracy-critical tasks, high memory |
| Use Case | Primary Models | Key Features | Memory Requirements |
|---|---|---|---|
| General Purpose | qwen-32b |
• Best overall capability • Strong reasoning and knowledge • Multilingual support |
8-32GB depending on quantization |
| Coding | deepseek-coder |
• Specialized for programming • Excellent code generation • Multiple language support |
6-24GB depending on quantization |
| Fast Responses | llama-8b |
• Quick inference • Good quality for size • Efficient memory usage |
4-16GB depending on quantization |
| High Quality | qwen-32b-hq |
• 8-bit quantization • Maximum accuracy • Research-grade outputs |
16-32GB |
| Reasoning | mixtral-8x7b |
• Mixture of experts architecture • Strong analytical capabilities • Complex problem solving |
12-48GB depending on quantization |
| Parameter | Default | Range | Description | Impact |
|---|---|---|---|---|
| temperature | 0.7 | 0.0-2.0 | Controls randomness in generation | Higher = more creative, Lower = more focused |
| top_p | 0.9 | 0.0-1.0 | Nucleus sampling threshold | Controls diversity of token selection |
| top_k | 40 | 1-100 | Top-k sampling limit | Limits token selection to top k candidates |
| max_tokens | 512 | 1-4096 | Maximum tokens to generate | Longer responses vs. speed trade-off |
| repetition_penalty | 1.1 | 1.0-2.0 | Penalty for repeating tokens | Reduces repetitive text |
| memory_fraction | 0.85 | 0.5-0.95 | GPU memory allocation | Higher = more capacity, risk of OOM |
| Preset | Compilation | Memory Usage | Speed | Quality | Best For |
|---|---|---|---|---|---|
| fast | Disabled | Low | Fastest | Good | Quick testing, development |
| balanced | Enabled | Medium | Fast | High | General use, production |
| maximum | Enabled + Optimizations | High | Fastest | Highest | Performance-critical tasks |
Out of Memory Error
# Reduce memory fraction in model config or use smaller model
mlx load llama-8b # Switch to smaller model
mlx unload # Unload current model firstModel Not Found
# Download the model first
mlx download qwen-32b
mlx models # Check available modelsSlow Performance
# Ensure compilation is enabled
mlx load qwen-32b --compile
# Use maximum preset for best performance
mlx load qwen-32b --preset maximumGeneration Hangs
# Check memory usage and clear cache
mlx health
# Restart if needed
mlx unload && mlx load qwen-32b- Enable Compilation: Always use
--compilefor 10-20% performance boost - Memory Management: Adjust
memory_fractionin model config (0.7-0.9) - Batch Size: Keep at 1 for large models, 2-4 for smaller ones
- Context Length: Reduce if experiencing memory issues
- Quantization: Use 4-bit quantization for memory efficiency
mlx --help # General help
mlx chat --help # Q&A command help
mlx stream --help # Interactive chat help
mlx models --help # Model management help- Add model configuration to
config/models.yaml - Test download and loading
- Update documentation
git clone <repo-url>
cd mlx-inference-cli
make dev # Install with development dependencies
make test # Run tests