Skip to content

A custom CLI tool for running local Large Language Models (LLMs) on Apple Silicon using Apple's MLX framework.

Notifications You must be signed in to change notification settings

stephensanwo/mlx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLX Inference CLI

A custom cli tool for running local Large Language Models (LLMs) on Apple Silicon using Apple's MLX framework.

Features

  • CLI-Only Interface: No server required, pure command-line operation
  • Multiple Model Support: Qwen, Mixtral, Llama, Phi-3, DeepSeek Coder, and more
  • Apple Silicon Optimized: Leverages MLX for maximum performance on M1/M2/M3 chips
  • Configurable: YAML-based configuration for models and inference settings
  • Interactive Chat: Full conversational interface with history management
  • Built-in Benchmarking: Performance testing and optimization tools
  • Streaming: Real-time token-by-token output
  • Extensive CLI Options: Rich set of parameters for fine-tuned control
  • File I/O Support: Read prompts from files, save outputs and chat history
  • Formatted Output: Rich formatting with progress bars and colored output

Installation

Prerequisites

  • macOS with Apple Silicon (M1/M2/M3)
  • Python 3.11 or later
  • uv (recommended) or pip

Quick Setup

# Clone the repository
git clone <your-repo-url>
cd mlx

# Install dependencies with uv (recommended)
make install

# Or install with pip
pip install -e .

# Verify installation
mlx info

Quick Start

# 1. List available models
mlx models

# 2. Download a model
mlx download qwen-32b

# 3. Ask a single question (Q&A mode)
mlx chat "Explain quantum computing" --model qwen-32b

# 4. Start an interactive chat session
mlx stream --model qwen-32b

HuggingFace Authentication

The CLI will automatically authenticate with hugging face

Or manually

brew install huggingface-cli
huggingface-cli login

Configuration

The application uses YAML configuration files for flexible model and inference management:

Model Configuration (config/models.yaml)

Define available models with their HuggingFace repositories, quantization settings, and memory configurations:

models:
  qwen-32b:
    name: "Qwen2.5-32B-Instruct"
    hf_repo: "Qwen/Qwen2.5-32B-Instruct"
    mlx_repo: "mlx-community/Qwen2.5-32B-Instruct-4bit"
    quantization:
      bits: 4
      group_size: 64
    context_length: 32768
    memory_fraction: 0.85
    description: "High-quality 32B parameter model with strong reasoning capabilities"

Inference Configuration (config/inference.yaml)

Control default generation parameters, performance settings, and memory management:

inference:
  generation:
    max_tokens: 512
    temperature: 0.7
    top_p: 0.9
    top_k: 40
    repetition_penalty: 1.1
  
  performance:
    compile: true
    warmup: true
    use_metal: true
  
  memory:
    clear_cache_after: 5
    gc_threshold: 0.8

CLI Commands

Core Commands

mlx info

Display system information and current status.

mlx info                    # Pretty formatted output
mlx info --format json     # JSON output
mlx info --format text     # Plain text output

mlx models

List all available models with their status.

mlx models                  # Basic model list
mlx models --detailed       # Detailed information including sizes
mlx models --format json   # JSON output for scripting

mlx load

Load a model into memory.

mlx load qwen-32b                    # Load with default settings
mlx load mixtral-8x7b --no-compile  # Load without compilation
mlx load phi-3-medium --preset fast # Load with performance preset
mlx load llama-8b --force           # Force reload if already loaded

mlx chat

Ask a single question and get an answer (Q&A mode).

# Basic Q&A
mlx chat "What is machine learning?" --model qwen-32b

# With custom parameters
mlx chat "Explain Python decorators" --model qwen-32b --max-tokens 1024 --temperature 0.3

# With system prompt preset
mlx chat "How do I center a div?" --preset coding

# Save to custom file
mlx chat "Tell me about AI" --save my_qa.json

# Different output formats
mlx chat "Hello world" --format json
mlx chat "Hello world" --format text

mlx stream

Interactive streaming chat session with real-time token generation.

# Basic interactive chat
mlx stream --model qwen-32b

# With custom parameters
mlx stream --model qwen-32b --temperature 0.9 --max-tokens 2048

# Show individual tokens (for debugging)
mlx stream --show-tokens --model deepseek-coder

# With system prompt preset
mlx stream --preset creative --model qwen-32b

Utility Commands

mlx download

Download models without loading them.

mlx download qwen-32b       # Download if not exists
mlx download mixtral-8x7b --force  # Force re-download

mlx unload

Unload the current model from memory.

mlx unload

mlx health

Check system health and memory usage.

mlx health
mlx health --format json

mlx benchmark

Run performance benchmarks.

mlx benchmark                           # Quick benchmark with default model
mlx benchmark --model qwen-32b --prompts 10  # Extended benchmark
mlx benchmark --output results.json    # Save results to file

Advanced Usage

Makefile Commands

The included Makefile provides convenient shortcuts:

make help                           # Show all available commands
make install                       # Install dependencies
make models                        # List models with details
make download MODEL=qwen-32b       # Download specific model
make chat MODEL=qwen-32b PROMPT="What is AI?"  # Ask single question
make stream MODEL=mixtral-8x7b     # Start interactive chat
make benchmark MODEL=phi-3-medium  # Run benchmark
make clean                         # Clean cache and temp files

Performance Presets

Use performance presets for different use cases:

# Fast: No compilation, lower memory usage
mlx load qwen-32b --preset fast

# Balanced: Standard settings (default)
mlx load qwen-32b --preset balanced

# Maximum: Full optimization, highest memory usage
mlx load qwen-32b --preset maximum

System Prompt Presets

Use system prompts for different scenarios, system prompts are in the /prompts directory, you can add new ones and include them in the chat settings in inference.yaml

mlx chat --preset coding      # Programming assistant
mlx chat --preset creative    # Creative writing helper
mlx chat --preset academic    # Academic research assistant

Shell Aliases

Add to your shell profile for convenience:

# Add to ~/.zshrc or ~/.bashrc
chmod +x <full_path>/install_aliases.sh
./install_aliases.sh

Model Recommendations

Hardware-Specific Recommendations

M1 Max (32GB)

  • Primary: qwen-32b, mixtral-8x7b, deepseek-coder
  • Memory fraction: 0.85

M1 Pro (16GB)

  • Primary: llama-8b, phi-3-medium, gemma-7b
  • Memory fraction: 0.80

M2 Ultra (64GB+)

  • Primary: qwen-32b, mixtral-8x7b with maximum preset
  • Memory fraction: 0.90

MLX Concepts Reference

Quantization Comparison

Quantization Level Memory Reduction Speed Impact Accuracy Impact Best For
4-bit ~75% (32GB → 8GB) Often faster due to reduced bandwidth 1-3% degradation typical Most use cases, limited memory
8-bit ~50% (32GB → 16GB) Moderate improvement <1% degradation Balance of quality and efficiency
16-bit (None) Full size Slower, higher bandwidth Maximum quality Accuracy-critical tasks, high memory

Model Recommendations by Use Case

Use Case Primary Models Key Features Memory Requirements
General Purpose qwen-32b • Best overall capability
• Strong reasoning and knowledge
• Multilingual support
8-32GB depending on quantization
Coding deepseek-coder • Specialized for programming
• Excellent code generation
• Multiple language support
6-24GB depending on quantization
Fast Responses llama-8b • Quick inference
• Good quality for size
• Efficient memory usage
4-16GB depending on quantization
High Quality qwen-32b-hq • 8-bit quantization
• Maximum accuracy
• Research-grade outputs
16-32GB
Reasoning mixtral-8x7b • Mixture of experts architecture
• Strong analytical capabilities
• Complex problem solving
12-48GB depending on quantization

Configuration Options

Parameter Default Range Description Impact
temperature 0.7 0.0-2.0 Controls randomness in generation Higher = more creative, Lower = more focused
top_p 0.9 0.0-1.0 Nucleus sampling threshold Controls diversity of token selection
top_k 40 1-100 Top-k sampling limit Limits token selection to top k candidates
max_tokens 512 1-4096 Maximum tokens to generate Longer responses vs. speed trade-off
repetition_penalty 1.1 1.0-2.0 Penalty for repeating tokens Reduces repetitive text
memory_fraction 0.85 0.5-0.95 GPU memory allocation Higher = more capacity, risk of OOM

Performance Presets

Preset Compilation Memory Usage Speed Quality Best For
fast Disabled Low Fastest Good Quick testing, development
balanced Enabled Medium Fast High General use, production
maximum Enabled + Optimizations High Fastest Highest Performance-critical tasks

Troubleshooting

Common Issues

Out of Memory Error

# Reduce memory fraction in model config or use smaller model
mlx load llama-8b  # Switch to smaller model
mlx unload         # Unload current model first

Model Not Found

# Download the model first
mlx download qwen-32b
mlx models  # Check available models

Slow Performance

# Ensure compilation is enabled
mlx load qwen-32b --compile
# Use maximum preset for best performance
mlx load qwen-32b --preset maximum

Generation Hangs

# Check memory usage and clear cache
mlx health
# Restart if needed
mlx unload && mlx load qwen-32b

Performance Tips

  1. Enable Compilation: Always use --compile for 10-20% performance boost
  2. Memory Management: Adjust memory_fraction in model config (0.7-0.9)
  3. Batch Size: Keep at 1 for large models, 2-4 for smaller ones
  4. Context Length: Reduce if experiencing memory issues
  5. Quantization: Use 4-bit quantization for memory efficiency

Getting Help

mlx --help              # General help
mlx chat --help         # Q&A command help
mlx stream --help       # Interactive chat help
mlx models --help       # Model management help

Adding New Models

  1. Add model configuration to config/models.yaml
  2. Test download and loading
  3. Update documentation

Building from Source

git clone <repo-url>
cd mlx-inference-cli
make dev  # Install with development dependencies
make test # Run tests

About

A custom CLI tool for running local Large Language Models (LLMs) on Apple Silicon using Apple's MLX framework.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published