Distributed LLM Orchestration Platform (D-LOP)

An end-to-end LLM fine-tuning and inference platform for SQL generation built around Llama-3.1 8B-Instruct, using QLoRA fine-tuning, 4-bit quantization, and scalable multi-GPU inference with vLLM.

This project demonstrates production-grade infrastructure for fine-tuning and deploying large language models at scale, with focus on memory efficiency, throughput optimization, and parallel inference across multiple GPUs.

Project Structure

infra/
├── main_fine_tuning_pipeline.ipynb        # QLoRA fine-tuning pipeline (Colab A100/H100)
│                                          # Loads Gretel dataset, trains adapter, saves to GDrive
│
├── model_load_to_HF.ipynb                 # Model export workflow
│                                          # Loads fine-tuned model from GDrive, pushes to HuggingFace Hub
│                                          # Includes test inference from HF hub
│
├── Inference_Quantization_SQL.ipynb       # Quantization and inference testing
│                                          # Quantizes model to 4-bit GGUF format
│                                          # Demonstrates inference from HF hub
│
├── tokenizer.py                           # Utility for prompt formatting
│                                          # Handles ChatML template application
│
├── requirements.txt                       # Python dependencies for local/RunPod environment
├── setup.sh                               # Local environment bootstrap (optional, not needed for Colab)
│
├── checkpoints/                           # Training checkpoints and LoRA adapters (GDrive)
├── data/                                  # Training dataset
│   ├── unified/master_dataset             # Preprocessed ChatML data
│   └── tokenized_master                   # Tokenized for training
│
└── venv/                                  # Python virtual environment (local only)

Workflow:

Fine-tune on Colab with main_fine_tuning_pipeline.ipynb → saved to GDrive
Export GGUF with model_load_to_HF.ipynb → first artifact to HuggingFace Hub
Merge & Quantize with Inference_Quantization_SQL.ipynb → create NVFP4 Safetensors
Deploy on RunPod (A5090 Blackwell) using the NVFP4 quantized model from HF hub

Quick Start

Training Phase (Google Colab)

Open main_fine_tuning_pipeline.ipynb in Google Colab
Set runtime: Runtime → Change runtime type → GPU → A100 or H100
Add HF_TOKEN secret (left sidebar) for model push
Run all cells top to bottom
Fine-tuned adapter is saved to your GDrive

What happens:

Loads Gretel SQL dataset (~100k samples)
Converts to ChatML format for Llama-3
Fine-tunes with QLoRA (4-bit) + Liger Kernels
Evaluates on holdout SQL samples
Saves checkpoint to checkpoints/ in GDrive

Training time:

GPU	VRAM	Runtime
A100 40GB	39GB	~4 hours
H100 95GB	78GB	~1.5 hours

Export Phase

Open model_load_to_HF.ipynb in Colab
Loads fine-tuned adapter from GDrive
Exports to GGUF format
Tests inference on sample queries
Pushes GGUF model to HuggingFace Hub

Quantization & Testing

Open Inference_Quantization_SQL.ipynb in Colab
Loads GGUF model from HF hub
Merges adapter weights with base model
Applies NVFP4 quantization (4-bit)
Tests inference performance
Exports quantized model as Safetensors
Pushes final quantized artifact to HuggingFace Hub

Inference Deployment (RunPod)

Launch RunPod GPU pod
SSH in and install vLLM: pip install vllm
Start vLLM server (see Deploying on RunPod)
Run Python inference client with your RunPod URL
Get SQL generation results via API

Single GPU Deployment (A5000 24GB VRAM)

Hardware Limitation: A5000 uses Ada architecture, which does not support NVFP4 quantization. Deploy with full-precision fp16 model instead:

CUDA_VISIBLE_DEVICES=0 nohup vllm serve \
  pshashid/llama3.1B_8B_SQL_Finetuned_model_fp16 \
  --tensor-parallel-size 1 \
  --kv-cache-dtype fp8 \
  --max-model-len 32000 \
  --dtype float16 \
  --enable-prefix-caching \
  --served-model-name sql-genie \
  --port 8000 \
  > /workspace/logs/vllm_gpu0.log 2>&1 &

Configuration for A5000 (fp16):

Full precision model (~19GB) required due to NVFP4 incompatibility
--max-model-len 32000 (reduced from 128000 to fit KV-cache in 1-2GB headroom)
--gpu-memory-utilization 0.95 to use available 24GB
Limited to single concurrent request
Expected throughput: 18-22 tokens/sec

Dual GPU Deployment (A5000 24GB VRAM × 2)

Deploy two parallel vLLM instances on separate A5000 GPUs, each running the fp16 (non-quantized) model:

# GPU 0: Instance 1 on port 8000
CUDA_VISIBLE_DEVICES=0 nohup vllm serve \
  pshashid/llama3.1B_8B_SQL_Finetuned_model_fp16 \
  --tensor-parallel-size 1 \
  --kv-cache-dtype fp8 \
  --max-model-len 32000 \
  --dtype float16 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.95 \
  --served-model-name sql-genie-gpu0 \
  --port 8000 \
  > /workspace/logs/vllm_gpu0.log 2>&1 &

# GPU 1: Instance 2 on port 8001
CUDA_VISIBLE_DEVICES=1 nohup vllm serve \
  pshashid/llama3.1B_8B_SQL_Finetuned_model_fp16 \
  --tensor-parallel-size 1 \
  --kv-cache-dtype fp8 \
  --max-model-len 32000 \
  --dtype float16 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.95 \
  --served-model-name sql-genie-gpu1 \
  --port 8001 \
  > /workspace/logs/vllm_gpu1.log 2>&1 &

Key configuration for A5000 dual deployment (Ada, fp16):

--max-model-len 32000 (tight memory constraints)
--gpu-memory-utilization 0.95 maximizes available 24GB per GPU
--kv-cache-dtype fp8 reduces KV-cache memory by 2×
Two independent full model instances (Ada lacks quantization support)
Each instance serves requests independently from port 8000 and 8001
Load balancer distributes requests round-robin across both instances
Combined throughput: ~36-44 tokens/sec (2× single GPU)
Model is fp16 (no Blackwell quantization available)

Deploying on RunPod

RunPod pods do not come with vLLM pre-installed. Set up the inference environment as follows:

# SSH into your RunPod instance
ssh root@<pod-ip>

# Install vLLM (required for inference)
pip install vllm

# Start vLLM server with your quantized model
CUDA_VISIBLE_DEVICES=0 nohup vllm serve \
  pshashid/llama3.1B_8B_SQL_Finetuned_model \
  --tensor-parallel-size 1 \
  --kv-cache-dtype fp8 \
  --max-model-len 128000 \
  --dtype float16 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.90 \
  --served-model-name sql-genie \
  --port 8000 \
  > /workspace/logs/vllm_server.log 2>&1 &

# Check if server is running
tail -f /workspace/logs/vllm_server.log

The vLLM server will be accessible at your RunPod URL (e.g., http://<runpod-url>:8000/v1/completions).

Dual GPU Deployment (A5090 32GB VRAM × 2 - Blackwell)

Deploy two parallel vLLM instances on separate A5090 GPUs (Blackwell architecture), each running the NVFP4 quantized model independently:

# GPU 0: Instance 1 on port 8000
CUDA_VISIBLE_DEVICES=0 nohup vllm serve \
  pshashid/llama3.1B_8B_SQL_Finetuned_model \
  --tensor-parallel-size 1 \
  --kv-cache-dtype fp8 \
  --max-model-len 128000 \
  --dtype float16 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.90 \
  --served-model-name sql-genie-gpu0 \
  --port 8000 \
  > /workspace/logs/vllm_gpu0.log 2>&1 &

# GPU 1: Instance 2 on port 8001
CUDA_VISIBLE_DEVICES=1 nohup vllm serve \
  pshashid/llama3.1B_8B_SQL_Finetuned_model \
  --tensor-parallel-size 1 \
  --kv-cache-dtype fp8 \
  --max-model-len 128000 \
  --dtype float16 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.90 \
  --served-model-name sql-genie-gpu1 \
  --port 8001 \
  > /workspace/logs/vllm_gpu1.log 2>&1 &

Key configuration for A5090 parallel deployment (Blackwell NVFP4):

--max-model-len 128000 (4× longer than A5000's 32000)
--gpu-memory-utilization 0.90 leverages 28-29GB per GPU
--kv-cache-dtype fp8 saves additional headroom for batch processing
Two independent full model instances maximize request concurrency
Each instance serves requests independently from port 8000 and 8001
Load balancer distributes requests round-robin across both instances
Model is NVFP4 quantized (Blackwell hardware acceleration required)

Architecture

Training Pipeline

Dataset (Gretel SQL 100k)
    ↓
ChatML Preprocessing
    ↓
QLoRA Fine-tuning (Unsloth + Liger)
    ↓
GGUF Export + HuggingFace Push (from GDrive)
    ↓
Adapter Merge
    ↓
4-bit Quantization (NVFP4)
    ↓
Safetensors + HuggingFace Push

Workflow breakdown:

Fine-tuning - QLoRA adapters trained on Gretel SQL dataset
Checkpoint Save - Saved to GDrive for persistence
GGUF Export - First export format pushed to HuggingFace Hub
Adapter Merge - Merge LoRA weights with base model
Quantization - Apply NVFP4 (Blackwell-optimized) quantization
Final Push - Quantized model uploaded as Safetensors format to HF Hub

The quantized Safetensors artifact is what's deployed on Blackwell GPUs (A5090) via vLLM.

Inference Architecture

Client Request
    ↓
Load Balancer / Router
    ↓
    ├─→ GPU0: vLLM Server (Port 8000) - Full Model Instance
    ├─→ GPU1: vLLM Server (Port 8001) - Full Model Instance
    └─→ GPU2: vLLM Server (Port 8002) - Full Model Instance [optional]
    ↓
SQL Generation Response (OpenAI-compatible API)

Each GPU runs a complete, independent model instance. Requests are load-balanced across instances for parallel processing.

Parallel vs. Distributed Inference:

This architecture uses parallel inference with multiple full model instances rather than distributed inference with model sharding. Each GPU has a complete copy of the model and operates independently. This approach is optimal for models that fit on individual GPUs, providing simpler deployment, better fault isolation, and easier horizontal scaling compared to tensor-parallelism-based distributed inference.

Training Configuration

Actual Hyperparameters Used

Parameter	Value	Notes
Model	`meta-llama/Llama-3.1-8B-Instruct`	Base model from Meta
LoRA Rank (R)	128	Optimized for 8B model, balances capacity vs VRAM
LoRA Alpha	128	Scaling factor (R = Alpha for 1.0 scaling)
LoRA Dropout	0.0	Disabled (large batch size provides regularization)
Target Modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj	All attention + feedforward projections
Max Seq Length	1024	Reduced from 2048 (quadratic attention cost)
Batch Size	128	Linear VRAM cost = fast training
Gradient Accumulation	2	Effective batch size: 256
Learning Rate	8e-4	Scaled with batch size
Warmup Ratio	0.05	5% linear warmup
Num Epochs	1	Single pass over dataset
Save Interval	Every 200 steps	Checkpoint frequency

Effective batch size: 128 × 2 = 256 per gradient update

Why MAX_SEQ_LENGTH=1024?

The SQL dataset's longest sequences are ~500 tokens. Using 1024 provides comfortable headroom without the quadratic overhead of 2048. Initial estimate with 2048 was 159+ hours; reducing to 1024 cut training time dramatically while maintaining full dataset coverage.

Dataset Configuration

Dataset	Samples	Purpose
Gretel SQL	100,000	SQL generation pairs
Finetome	100,000	General instruction following
UltraChat	200,000	Conversational diversity
Total	400,000	Mixed training corpus

All datasets converted to Llama-3.1 chat template format before training.

Output Artifacts

Artifact	Location	Size	Format	Purpose
LoRA Adapter	`sql-genie-lora/`	~33 MB	PyTorch	Fine-tuned weights only
GGUF Model	HuggingFace Hub	~9-10 GB	GGUF	Intermediate checkpoint
Merged + Quantized	HuggingFace Hub	~4-5 GB	Safetensors (NVFP4)	Production deployment
Checkpoints	GDrive `/models/Infra/`	~33 MB each	LoRA	Training saves (every 200 steps)

Deployment Artifact: Use the Safetensors NVFP4 model from HuggingFace Hub. Requires Blackwell architecture (A5090) for hardware-accelerated inference.

Quantization Strategy

The model uses a two-stage quantization approach optimized for Blackwell architecture inference:

Weights: NVFP4 (4-bit)

NVFP4 is NVIDIA's proprietary 4-bit floating-point format native to Blackwell GPUs. It provides hardware-accelerated inference with minimal accuracy loss compared to fp16.

Compression achieved: ~16GB (fp16) → ~4-5GB NVFP4 (4× reduction)

KV-Cache: FP8 (8-bit)

During token generation, attention key-value caches are stored in FP8 format with tensor-wise dynamic quantization. This further reduces memory consumption during inference without affecting output quality.

Configuration:

Format: FP8 (8-bit floating point)
Strategy: Tensor-wise (per-tensor quantization)
Dynamic: Disabled (static quantization)
Symmetric: True (symmetric range)

Calibration

Quantization uses a calibration dataset of 128 representative SQL generation samples to determine optimal quantization parameters. Samples cover diverse SQL patterns (aggregations, joins, filtering, grouping) to ensure the quantized model generalizes well across different schemas.

Result

Combined NVFP4 weights + FP8 KV-cache achieves:

Model weights: 4× smaller (16GB → 4-5GB)
Peak inference VRAM: 27-28GB → 14-16GB on A5090
No degradation in SQL generation quality
Hardware-accelerated inference on Blackwell GPUs

Inference Benchmarks

Single GPU Performance (A5000 24GB VRAM)

Hardware Limitation: A5000 uses Ada architecture which does not support NVFP4 quantization format. Requires full-precision fp16 model.

Model Instance: 1× Llama-3.1 8B-Instruct (fp16, unquantized)

Metric	Value	Notes
Base Load VRAM	23-24 GB	Full precision weights
Peak VRAM (256 tokens)	23.8 GB	Limited headroom for KV-cache
Throughput (batch=1)	18-22 tokens/sec	Slower than A5090 due to precision
Latency (128 tokens)	5.8-6.4 sec	Higher latency vs quantized
Latency (256 tokens)	11.6-12.8 sec	Memory bandwidth limited
Max Batch Size	1	No room for batching
Max Seq Length	32000	Limited by tight memory margins

Note: For A5000 deployment, use unquantized models or alternative quantization formats (8-bit) instead of NVFP4.

Dual GPU Performance (A5090 32GB VRAM × 2)

Hardware Advantage: A5090 uses Blackwell architecture with full support for NVFP4 quantization, enabling 3-4× better performance than A5000.

Model Configuration: 2× parallel instances of NVFP4 quantized models (full capacity)

Metric	A5000 Single	A5000 Dual (fp16)	A5090 Dual (NVFP4)	A5090 Improvement
Quantization	Full precision	Full precision	NVFP4 (Blackwell)	Hardware-accelerated
Base Load VRAM/GPU	23-24 GB	23-24 GB	4-5 GB	4.8-5.5× smaller
Available for KV-cache	<1 GB	<1 GB	27-28 GB	27× more headroom
Throughput	18-22 tok/sec	36-44 tok/sec	68-75 tok/sec	1.9× vs A5000 dual
Latency P99	6.4 sec	3.2 sec	1.7 sec	1.9× vs A5000 dual
Max Concurrent Requests	1	2-4	20-24	5-10× vs A5000 dual
Max Seq Length	32000	32000	128000	4× longer contexts

*A5090 NVFP4 compression (4-5GB) vs A5000 fp16 (23-24GB) enables dramatically better scaling and concurrency

Memory Breakdown

Model Load Phase (A5000 24GB - Full Precision):

Base Model Weights (fp16):   ~19 GB
Attention Buffers:           ~2.5 GB
CUDA Runtime Overhead:       ~0.5-1 GB
─────────────────────────────────
Total:                       ~22-24 GB (max capacity)
Headroom for KV-cache:       <1 GB

Model Load Phase (A5090 32GB - NVFP4 Quantized):

NVFP4 weights combined with FP8 KV-cache quantization:

Base Model Weights (NVFP4):  ~4-5 GB (4× compression vs fp16)
Attention Buffers:           ~1-1.5 GB
CUDA Runtime Overhead:       ~0.5-1 GB
─────────────────────────────────
Total Model:                 ~6-7 GB
Available for KV-cache (FP8): ~25-26 GB

Generation Phase (KV-Cache Growth - A5090 NVFP4):

Model Load:                  ~6-7 GB
Prompt Processing (512 tok): ~10-12 GB (FP8 KV-cache)
Token-by-token Generation:   ~12-24 GB (grows linearly, FP8)

Formula (with FP8): VRAM_gen ≈ base_load + (seq_len × layers × hidden_size / 8)

A5090 Blackwell advantage: 25GB+ available for KV-cache enables:
                           - Very long sequences (128k tokens)
                           - Higher batch sizes (4-6 concurrent requests)
                           - No memory spillover

Architecture Requirement:

NVFP4 weights quantization only works on Blackwell architecture (A5090). Ada architecture (A5000) cannot load NVFP4 models. This is a hardware-level limitation due to Blackwell's native FP4 support.

Inference Deployment Details

Prompt Formatting

All prompts must be formatted using the Llama-3.1 chat template before sending to vLLM. This is done on the client side, not the server.

Why: vLLM receives pre-formatted strings and does not apply tokenization templates itself. The template is automatically pulled from the model's tokenizer_config.json when loading.

Llama-3.1 Chat Template:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_query}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

This template is automatically applied when you load the tokenizer from HuggingFace and call apply_chat_template() with your messages. The add_generation_prompt=True parameter appends the assistant header to signal the model to start generating.

The formatted prompt string is then sent to vLLM's API endpoint.

vLLM Configuration Flags

Flag	Value	Purpose
`--tensor-parallel-size 1`	1	Single GPU inference (no model sharding)
`--kv-cache-dtype fp8`	fp8	Reduces KV-cache memory by 2×
`--max-model-len`	128000 (A5090) / 32000 (A5000)	Supports long context windows
`--dtype float16`	float16	Fast mixed-precision compute
`--enable-prefix-caching`	enabled	Reuses KV-cache for repeated prompts
`--served-model-name`	custom	OpenAI-compatible model identifier

Important: Prompts sent to vLLM must be pre-tokenized using the Llama-3 tokenizer's apply_chat_template() method on the client side. vLLM expects already-formatted strings, not raw text.

API Usage

The vLLM server exposes an OpenAI-compatible API. The inference workflow is:

Load tokenizer from HuggingFace Hub
Build messages with system prompt (schema) and user query
Apply template using apply_chat_template(messages, add_generation_prompt=True)
POST formatted prompt to vLLM endpoint: http://<pod-url>:8000/v1/completions
Receive generated SQL in response

Request format:

{
  "model": "sql-genie",
  "prompt": "[ChatML formatted prompt from tokenizer]",
  "max_tokens": 200,
  "temperature": 0.0
}

Response format:

{
  "id": "cmpl-12345",
  "object": "text_completion",
  "created": 1710510234,
  "model": "sql-genie",
  "choices": [
    {
      "text": "SELECT customer_id, SUM(amount) FROM orders WHERE created_at >= NOW() - INTERVAL 30 DAY GROUP BY customer_id;",
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 45,
    "completion_tokens": 32,
    "total_tokens": 77
  }
}

Key requirements:

Tokenize prompts locally using the Llama-3.1 tokenizer's apply_chat_template() method
Follow Llama-3.1 ChatML format (system + user roles, generation prompt appended)
Send formatted string (not raw text) to vLLM API
Use temperature 0.0 for deterministic SQL generation

Dataset

Primary Dataset: Gretel SQL

Dataset	Samples	Format	Use Case
`gretelai/synthetic_text_to_sql`	~100k	ChatML	SQL generation

Data Pipeline:

Raw Dataset (JSON)
    ↓
ChatML Conversion
    ↓
Tokenization (Llama-3 template)
    ↓
Arrow/Parquet (vectorized training)
    ↓
Streaming to GPU during training

Model Outputs

Artifact	Location	Size	Use Case
LoRA Adapter	`sql-genie-lora/`	~33 MB	Fine-tuned weights
Merged Model	HuggingFace Hub	~16 GB	Full model (unquantized)
Quantized Model (GGUF)	`sql-genie-gguf/*.gguf`	~4-8 GB	Q4_K_M deployment artifact
Checkpoints	`checkpoints/`	~33 MB each	Mid-training saves (every 200 steps)

Local Setup (Optional)

For local development without Colab:

chmod +x setup.sh && ./setup.sh
source venv/bin/activate
pip install -r requirements.txt

# Run fine-tuning locally (requires NVIDIA GPU with 24GB+ VRAM)
jupyter notebook main_fine_tuning_pipeline.ipynb

Key Technologies

Component	Technology	Purpose
Model	Llama-3.1 8B-Instruct	Base LLM for SQL generation
Fine-tuning	Unsloth + Liger Kernels	2-10× training speedup
Quantization	NVFP4 (Blackwell)	Hardware-accelerated 4-bit format
Inference	vLLM	High-throughput batch processing
Parallel Inference	Multi-GPU instances	Independent servers per GPU
Deployment	OpenAI-compatible API	Standardized client interface
Model Hub	HuggingFace	Model versioning & distribution

NVFP4 Note: Quantization format requires Blackwell architecture for hardware-accelerated inference (A5090, newer H100 variants). Ada architecture (A5000, A6000) requires fp16 deployment.

Performance Tuning

For Throughput Optimization

Increase batch size if VRAM allows
Enable prefix caching for repeated prompts
Use fp8 KV-cache instead of fp16
Add more GPU instances for parallel request handling

For Latency Reduction

Reduce max-model-len if extended context windows are not required
Use smaller batch size (1-2) for single concurrent requests
Pre-warm the model with dummy requests to stabilize execution
Enable prefix caching for commonly repeated prompts

Troubleshooting

Out of Memory During Training

Issue: CUDA OOM error during fine-tuning

Resolution: Reduce BATCH_SIZE from 8 to 4, reduce MAX_SEQ_LENGTH from 2048 to 1024, or reduce LORA_R from 64 to 32. Gradient checkpointing is enabled by default in Unsloth and provides additional memory savings.

Out of Memory During Inference

Issue: vLLM fails to load model on target GPU

Resolution: Verify --kv-cache-dtype is set to fp8. Consider reducing --max-model-len from 64000 to 32000 if extended context is not required. Lower batch size in inference client to reduce peak memory consumption. Switch to quantized GGUF artifact if available.

Slow Inference Performance

Issue: Inference throughput below 20 tokens/second

Resolution: Check GPU utilization with nvidia-smi to determine if memory bandwidth is the bottleneck. Verify vLLM process is bound to correct GPU via logs. Increase batch size if request latency allows. Profile CPU-GPU communication for potential bottlenecks.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Inference_Quantization_SQL.ipynb		Inference_Quantization_SQL.ipynb
README.md		README.md
inference_client.py		inference_client.py
main_fine_tuning_pipeline.ipynb		main_fine_tuning_pipeline.ipynb
model_load_to_HF.ipynb		model_load_to_HF.ipynb
requirements.txt		requirements.txt
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

Distributed LLM Orchestration Platform (D-LOP)

Project Structure

Quick Start

Training Phase (Google Colab)

Export Phase

Quantization & Testing

Inference Deployment (RunPod)

Single GPU Deployment (A5000 24GB VRAM)

Dual GPU Deployment (A5000 24GB VRAM × 2)

Deploying on RunPod

Dual GPU Deployment (A5090 32GB VRAM × 2 - Blackwell)

Architecture

Training Pipeline

Inference Architecture

Training Configuration

Actual Hyperparameters Used

Dataset Configuration

Output Artifacts

Quantization Strategy

Weights: NVFP4 (4-bit)

KV-Cache: FP8 (8-bit)

Calibration

Result

Inference Benchmarks

Single GPU Performance (A5000 24GB VRAM)

Dual GPU Performance (A5090 32GB VRAM × 2)

Memory Breakdown

Memory Breakdown

Inference Deployment Details

Prompt Formatting

vLLM Configuration Flags

API Usage

Dataset

Model Outputs

Local Setup (Optional)

Key Technologies

Performance Tuning

For Throughput Optimization

For Latency Reduction

Troubleshooting

Out of Memory During Training

Out of Memory During Inference

Slow Inference Performance

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages