AtlasInfer v2.0

Overview

AtlasInfer enables efficient LLM inference on consumer hardware with LADQ (Layer-Adaptive Dynamic Quantization) - a novel approach that allocates precision per-layer based on sensitivity analysis.

What Makes LADQ Novel?

Existing Methods	LADQ
AWQ: Uses activation statistics to optimize quantization scales	Uses sensitivity to allocate precision levels
GPTQ: One-time calibration, static precision	Dynamic per-layer precision (FP16/FP8/FP4)
bitsandbytes: Uniform bit-width for all layers	Budget-aware mixed precision

Key insight: Given a fixed memory budget, spend precision where it matters most.

Installation

# Clone
git clone https://github.com/YOUR_USERNAME/AtlasInfer.git
cd AtlasInfer

# Install PyTorch (from pytorch.org for your CUDA version)
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Install dependencies
pip install -r requirements.txt

Quick Start

Basic Inference (Uniform FP8)

python -m atlasinfer.inference --model facebook/opt-1.3b --prompt "Hello world"

LADQ with Memory Budget

# Allocate 2GB memory budget - LADQ decides precision per layer
python -m atlasinfer.inference --model facebook/opt-1.3b --prompt "Hello" --memory-budget 2.0

Python API

from atlasinfer import AtlasInference

# Basic
engine = AtlasInference("facebook/opt-1.3b")
print(engine.generate("The future of AI is"))

# With LADQ
engine = AtlasInference("facebook/opt-1.3b", memory_budget_gb=2.0)

Advanced: Manual LADQ Pipeline

from transformers import AutoModelForCausalLM
from atlasinfer import (
    compute_sensitivity_scores,
    PrecisionAllocator,
    get_layer_sizes,
    quantize_model_ladq
)

# 1. Load model
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", torch_dtype="auto")

# 2. Profile sensitivities
sensitivities = compute_sensitivity_scores(model)

# 3. Allocate precision within budget
allocator = PrecisionAllocator()
layer_sizes = get_layer_sizes(model)
allocation = allocator.allocate(
    sensitivities,
    layer_sizes,
    memory_budget_bytes=2 * 1024**3  # 2 GB
)

# 4. Apply mixed-precision quantization
quantize_model_ladq(model, allocation.allocations)

Architecture

atlasinfer/
├── quantizer.py      # FP8 + FP4 quantization
├── linear.py         # QuantizedLinear, QuantizedLinear4bit
├── sensitivity.py    # Layer sensitivity profiler (NOVEL)
├── allocator.py      # Budget-aware precision allocator (NOVEL)
├── patcher.py        # Model patching (uniform + LADQ)
├── offload.py        # CPU↔GPU offload hooks
└── inference.py      # Main API + CLI

CLI Arguments

Argument	Description
`--model, -m`	HuggingFace model name
`--prompt, -p`	Input prompt
`--memory-budget`	VRAM budget in GB (enables LADQ)
`--cpu-offload`	Enable CPU offloading
`--no-quantize`	Disable quantization (FP16 baseline)

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
atlasinfer		atlasinfer
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AtlasInfer v2.0

Overview

What Makes LADQ Novel?

Installation

Quick Start

Basic Inference (Uniform FP8)

LADQ with Memory Budget

Python API

Advanced: Manual LADQ Pipeline

Architecture

CLI Arguments

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

nishantkluhera/AtlasInfer

Folders and files

Latest commit

History

Repository files navigation

AtlasInfer v2.0

Overview

What Makes LADQ Novel?

Installation

Quick Start

Basic Inference (Uniform FP8)

LADQ with Memory Budget

Python API

Advanced: Manual LADQ Pipeline

Architecture

CLI Arguments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages