MLX Inference CLI

A custom cli tool for running local Large Language Models (LLMs) on Apple Silicon using Apple's MLX framework.

Features

CLI-Only Interface: No server required, pure command-line operation
Multiple Model Support: Qwen, Mixtral, Llama, Phi-3, DeepSeek Coder, and more
Apple Silicon Optimized: Leverages MLX for maximum performance on M1/M2/M3 chips
Configurable: YAML-based configuration for models and inference settings
Interactive Chat: Full conversational interface with history management
Built-in Benchmarking: Performance testing and optimization tools
Streaming: Real-time token-by-token output
Extensive CLI Options: Rich set of parameters for fine-tuned control
File I/O Support: Read prompts from files, save outputs and chat history
Formatted Output: Rich formatting with progress bars and colored output

Installation

Prerequisites

macOS with Apple Silicon (M1/M2/M3)
Python 3.11 or later
uv (recommended) or pip

Quick Setup

# Clone the repository
git clone <your-repo-url>
cd mlx

# Install dependencies with uv (recommended)
make install

# Or install with pip
pip install -e .

# Verify installation
mlx info

Quick Start

# 1. List available models
mlx models

# 2. Download a model
mlx download qwen-32b

# 3. Ask a single question (Q&A mode)
mlx chat "Explain quantum computing" --model qwen-32b

# 4. Start an interactive chat session
mlx stream --model qwen-32b

HuggingFace Authentication

The CLI will automatically authenticate with hugging face

Or manually

brew install huggingface-cli
huggingface-cli login

Configuration

The application uses YAML configuration files for flexible model and inference management:

Model Configuration (`config/models.yaml`)

Define available models with their HuggingFace repositories, quantization settings, and memory configurations:

models:
  qwen-32b:
    name: "Qwen2.5-32B-Instruct"
    hf_repo: "Qwen/Qwen2.5-32B-Instruct"
    mlx_repo: "mlx-community/Qwen2.5-32B-Instruct-4bit"
    quantization:
      bits: 4
      group_size: 64
    context_length: 32768
    memory_fraction: 0.85
    description: "High-quality 32B parameter model with strong reasoning capabilities"

Inference Configuration (`config/inference.yaml`)

Control default generation parameters, performance settings, and memory management:

inference:
  generation:
    max_tokens: 512
    temperature: 0.7
    top_p: 0.9
    top_k: 40
    repetition_penalty: 1.1
  
  performance:
    compile: true
    warmup: true
    use_metal: true
  
  memory:
    clear_cache_after: 5
    gc_threshold: 0.8

CLI Commands

Core Commands

`mlx info`

Display system information and current status.

mlx info                    # Pretty formatted output
mlx info --format json     # JSON output
mlx info --format text     # Plain text output

`mlx models`

List all available models with their status.

mlx models                  # Basic model list
mlx models --detailed       # Detailed information including sizes
mlx models --format json   # JSON output for scripting

`mlx load`

Load a model into memory.

mlx load qwen-32b                    # Load with default settings
mlx load mixtral-8x7b --no-compile  # Load without compilation
mlx load phi-3-medium --preset fast # Load with performance preset
mlx load llama-8b --force           # Force reload if already loaded

`mlx chat`

Ask a single question and get an answer (Q&A mode).

# Basic Q&A
mlx chat "What is machine learning?" --model qwen-32b

# With custom parameters
mlx chat "Explain Python decorators" --model qwen-32b --max-tokens 1024 --temperature 0.3

# With system prompt preset
mlx chat "How do I center a div?" --preset coding

# Save to custom file
mlx chat "Tell me about AI" --save my_qa.json

# Different output formats
mlx chat "Hello world" --format json
mlx chat "Hello world" --format text

`mlx stream`

Interactive streaming chat session with real-time token generation.

# Basic interactive chat
mlx stream --model qwen-32b

# With custom parameters
mlx stream --model qwen-32b --temperature 0.9 --max-tokens 2048

# Show individual tokens (for debugging)
mlx stream --show-tokens --model deepseek-coder

# With system prompt preset
mlx stream --preset creative --model qwen-32b

Utility Commands

`mlx download`

Download models without loading them.

mlx download qwen-32b       # Download if not exists
mlx download mixtral-8x7b --force  # Force re-download

`mlx unload`

Unload the current model from memory.

mlx unload

`mlx health`

Check system health and memory usage.

mlx health
mlx health --format json

`mlx benchmark`

Run performance benchmarks.

mlx benchmark                           # Quick benchmark with default model
mlx benchmark --model qwen-32b --prompts 10  # Extended benchmark
mlx benchmark --output results.json    # Save results to file

Advanced Usage

Makefile Commands

The included Makefile provides convenient shortcuts:

make help                           # Show all available commands
make install                       # Install dependencies
make models                        # List models with details
make download MODEL=qwen-32b       # Download specific model
make chat MODEL=qwen-32b PROMPT="What is AI?"  # Ask single question
make stream MODEL=mixtral-8x7b     # Start interactive chat
make benchmark MODEL=phi-3-medium  # Run benchmark
make clean                         # Clean cache and temp files

Performance Presets

Use performance presets for different use cases:

# Fast: No compilation, lower memory usage
mlx load qwen-32b --preset fast

# Balanced: Standard settings (default)
mlx load qwen-32b --preset balanced

# Maximum: Full optimization, highest memory usage
mlx load qwen-32b --preset maximum

System Prompt Presets

Use system prompts for different scenarios, system prompts are in the /prompts directory, you can add new ones and include them in the chat settings in inference.yaml

mlx chat --preset coding      # Programming assistant
mlx chat --preset creative    # Creative writing helper
mlx chat --preset academic    # Academic research assistant

Shell Aliases

Add to your shell profile for convenience:

# Add to ~/.zshrc or ~/.bashrc
chmod +x <full_path>/install_aliases.sh
./install_aliases.sh

Model Recommendations

Hardware-Specific Recommendations

M1 Max (32GB)

Primary: qwen-32b, mixtral-8x7b, deepseek-coder
Memory fraction: 0.85

M1 Pro (16GB)

Primary: llama-8b, phi-3-medium, gemma-7b
Memory fraction: 0.80

M2 Ultra (64GB+)

Primary: qwen-32b, mixtral-8x7b with maximum preset
Memory fraction: 0.90

MLX Concepts Reference

Quantization Comparison

Quantization Level	Memory Reduction	Speed Impact	Accuracy Impact	Best For
4-bit	~75% (32GB → 8GB)	Often faster due to reduced bandwidth	1-3% degradation typical	Most use cases, limited memory
8-bit	~50% (32GB → 16GB)	Moderate improvement	<1% degradation	Balance of quality and efficiency
16-bit (None)	Full size	Slower, higher bandwidth	Maximum quality	Accuracy-critical tasks, high memory

Model Recommendations by Use Case

Use Case	Primary Models	Key Features	Memory Requirements
General Purpose	`qwen-32b`	• Best overall capability • Strong reasoning and knowledge • Multilingual support	8-32GB depending on quantization
Coding	`deepseek-coder`	• Specialized for programming • Excellent code generation • Multiple language support	6-24GB depending on quantization
Fast Responses	`llama-8b`	• Quick inference • Good quality for size • Efficient memory usage	4-16GB depending on quantization
High Quality	`qwen-32b-hq`	• 8-bit quantization • Maximum accuracy • Research-grade outputs	16-32GB
Reasoning	`mixtral-8x7b`	• Mixture of experts architecture • Strong analytical capabilities • Complex problem solving	12-48GB depending on quantization

Configuration Options

Parameter	Default	Range	Description	Impact
temperature	0.7	0.0-2.0	Controls randomness in generation	Higher = more creative, Lower = more focused
top_p	0.9	0.0-1.0	Nucleus sampling threshold	Controls diversity of token selection
top_k	40	1-100	Top-k sampling limit	Limits token selection to top k candidates
max_tokens	512	1-4096	Maximum tokens to generate	Longer responses vs. speed trade-off
repetition_penalty	1.1	1.0-2.0	Penalty for repeating tokens	Reduces repetitive text
memory_fraction	0.85	0.5-0.95	GPU memory allocation	Higher = more capacity, risk of OOM

Performance Presets

Preset	Compilation	Memory Usage	Speed	Quality	Best For
fast	Disabled	Low	Fastest	Good	Quick testing, development
balanced	Enabled	Medium	Fast	High	General use, production
maximum	Enabled + Optimizations	High	Fastest	Highest	Performance-critical tasks

Troubleshooting

Common Issues

Out of Memory Error

# Reduce memory fraction in model config or use smaller model
mlx load llama-8b  # Switch to smaller model
mlx unload         # Unload current model first

Model Not Found

# Download the model first
mlx download qwen-32b
mlx models  # Check available models

Slow Performance

# Ensure compilation is enabled
mlx load qwen-32b --compile
# Use maximum preset for best performance
mlx load qwen-32b --preset maximum

Generation Hangs

# Check memory usage and clear cache
mlx health
# Restart if needed
mlx unload && mlx load qwen-32b

Performance Tips

Enable Compilation: Always use --compile for 10-20% performance boost
Memory Management: Adjust memory_fraction in model config (0.7-0.9)
Batch Size: Keep at 1 for large models, 2-4 for smaller ones
Context Length: Reduce if experiencing memory issues
Quantization: Use 4-bit quantization for memory efficiency

Getting Help

mlx --help              # General help
mlx chat --help         # Q&A command help
mlx stream --help       # Interactive chat help
mlx models --help       # Model management help

Adding New Models

Add model configuration to config/models.yaml
Test download and loading
Update documentation

Building from Source

git clone <repo-url>
cd mlx-inference-cli
make dev  # Install with development dependencies
make test # Run tests

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
prompts		prompts
src		src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
diag.md		diag.md
install_aliases.sh		install_aliases.sh
mlx_lm.md		mlx_lm.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

stephensanwo/mlx

Folders and files

Latest commit

History

Repository files navigation

MLX Inference CLI

Features

Installation

Prerequisites

Quick Setup

Quick Start

HuggingFace Authentication

Configuration

Model Configuration (config/models.yaml)

Inference Configuration (config/inference.yaml)

CLI Commands

Core Commands

mlx info

mlx models

mlx load

mlx chat

mlx stream

Utility Commands

mlx download

mlx unload

mlx health

mlx benchmark

Advanced Usage

Makefile Commands

Performance Presets

System Prompt Presets

Shell Aliases

Model Recommendations

Hardware-Specific Recommendations

MLX Concepts Reference

Quantization Comparison

Model Recommendations by Use Case

Configuration Options

Performance Presets

Troubleshooting

Common Issues

Performance Tips

Getting Help

Adding New Models

Building from Source

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Model Configuration (`config/models.yaml`)

Inference Configuration (`config/inference.yaml`)

`mlx info`

`mlx models`

`mlx load`

`mlx chat`

`mlx stream`

`mlx download`

`mlx unload`

`mlx health`

`mlx benchmark`

Packages