An end-to-end LLM fine-tuning and inference platform for SQL generation built around Llama-3.1 8B-Instruct, using QLoRA fine-tuning, 4-bit quantization, and scalable multi-GPU inference with vLLM.
This project demonstrates production-grade infrastructure for fine-tuning and deploying large language models at scale, with focus on memory efficiency, throughput optimization, and parallel inference across multiple GPUs.
infra/
├── main_fine_tuning_pipeline.ipynb # QLoRA fine-tuning pipeline (Colab A100/H100)
│ # Loads Gretel dataset, trains adapter, saves to GDrive
│
├── model_load_to_HF.ipynb # Model export workflow
│ # Loads fine-tuned model from GDrive, pushes to HuggingFace Hub
│ # Includes test inference from HF hub
│
├── Inference_Quantization_SQL.ipynb # Quantization and inference testing
│ # Quantizes model to 4-bit GGUF format
│ # Demonstrates inference from HF hub
│
├── tokenizer.py # Utility for prompt formatting
│ # Handles ChatML template application
│
├── requirements.txt # Python dependencies for local/RunPod environment
├── setup.sh # Local environment bootstrap (optional, not needed for Colab)
│
├── checkpoints/ # Training checkpoints and LoRA adapters (GDrive)
├── data/ # Training dataset
│ ├── unified/master_dataset # Preprocessed ChatML data
│ └── tokenized_master # Tokenized for training
│
└── venv/ # Python virtual environment (local only)
Workflow:
- Fine-tune on Colab with
main_fine_tuning_pipeline.ipynb→ saved to GDrive - Export GGUF with
model_load_to_HF.ipynb→ first artifact to HuggingFace Hub - Merge & Quantize with
Inference_Quantization_SQL.ipynb→ create NVFP4 Safetensors - Deploy on RunPod (A5090 Blackwell) using the NVFP4 quantized model from HF hub
- Open
main_fine_tuning_pipeline.ipynbin Google Colab - Set runtime: Runtime → Change runtime type → GPU → A100 or H100
- Add
HF_TOKENsecret (left sidebar) for model push - Run all cells top to bottom
- Fine-tuned adapter is saved to your GDrive
What happens:
- Loads Gretel SQL dataset (~100k samples)
- Converts to ChatML format for Llama-3
- Fine-tunes with QLoRA (4-bit) + Liger Kernels
- Evaluates on holdout SQL samples
- Saves checkpoint to
checkpoints/in GDrive
Training time:
| GPU | VRAM | Runtime |
|---|---|---|
| A100 40GB | 39GB | ~4 hours |
| H100 95GB | 78GB | ~1.5 hours |
- Open
model_load_to_HF.ipynbin Colab - Loads fine-tuned adapter from GDrive
- Exports to GGUF format
- Tests inference on sample queries
- Pushes GGUF model to HuggingFace Hub
- Open
Inference_Quantization_SQL.ipynbin Colab - Loads GGUF model from HF hub
- Merges adapter weights with base model
- Applies NVFP4 quantization (4-bit)
- Tests inference performance
- Exports quantized model as Safetensors
- Pushes final quantized artifact to HuggingFace Hub
- Launch RunPod GPU pod
- SSH in and install vLLM:
pip install vllm - Start vLLM server (see Deploying on RunPod)
- Run Python inference client with your RunPod URL
- Get SQL generation results via API
Hardware Limitation: A5000 uses Ada architecture, which does not support NVFP4 quantization. Deploy with full-precision fp16 model instead:
CUDA_VISIBLE_DEVICES=0 nohup vllm serve \
pshashid/llama3.1B_8B_SQL_Finetuned_model_fp16 \
--tensor-parallel-size 1 \
--kv-cache-dtype fp8 \
--max-model-len 32000 \
--dtype float16 \
--enable-prefix-caching \
--served-model-name sql-genie \
--port 8000 \
> /workspace/logs/vllm_gpu0.log 2>&1 &Configuration for A5000 (fp16):
- Full precision model (~19GB) required due to NVFP4 incompatibility
--max-model-len 32000(reduced from 128000 to fit KV-cache in 1-2GB headroom)--gpu-memory-utilization 0.95to use available 24GB- Limited to single concurrent request
- Expected throughput: 18-22 tokens/sec
Deploy two parallel vLLM instances on separate A5000 GPUs, each running the fp16 (non-quantized) model:
# GPU 0: Instance 1 on port 8000
CUDA_VISIBLE_DEVICES=0 nohup vllm serve \
pshashid/llama3.1B_8B_SQL_Finetuned_model_fp16 \
--tensor-parallel-size 1 \
--kv-cache-dtype fp8 \
--max-model-len 32000 \
--dtype float16 \
--enable-prefix-caching \
--gpu-memory-utilization 0.95 \
--served-model-name sql-genie-gpu0 \
--port 8000 \
> /workspace/logs/vllm_gpu0.log 2>&1 &
# GPU 1: Instance 2 on port 8001
CUDA_VISIBLE_DEVICES=1 nohup vllm serve \
pshashid/llama3.1B_8B_SQL_Finetuned_model_fp16 \
--tensor-parallel-size 1 \
--kv-cache-dtype fp8 \
--max-model-len 32000 \
--dtype float16 \
--enable-prefix-caching \
--gpu-memory-utilization 0.95 \
--served-model-name sql-genie-gpu1 \
--port 8001 \
> /workspace/logs/vllm_gpu1.log 2>&1 &Key configuration for A5000 dual deployment (Ada, fp16):
--max-model-len 32000(tight memory constraints)--gpu-memory-utilization 0.95maximizes available 24GB per GPU--kv-cache-dtype fp8reduces KV-cache memory by 2×- Two independent full model instances (Ada lacks quantization support)
- Each instance serves requests independently from port 8000 and 8001
- Load balancer distributes requests round-robin across both instances
- Combined throughput: ~36-44 tokens/sec (2× single GPU)
- Model is fp16 (no Blackwell quantization available)
RunPod pods do not come with vLLM pre-installed. Set up the inference environment as follows:
# SSH into your RunPod instance
ssh root@<pod-ip>
# Install vLLM (required for inference)
pip install vllm
# Start vLLM server with your quantized model
CUDA_VISIBLE_DEVICES=0 nohup vllm serve \
pshashid/llama3.1B_8B_SQL_Finetuned_model \
--tensor-parallel-size 1 \
--kv-cache-dtype fp8 \
--max-model-len 128000 \
--dtype float16 \
--enable-prefix-caching \
--gpu-memory-utilization 0.90 \
--served-model-name sql-genie \
--port 8000 \
> /workspace/logs/vllm_server.log 2>&1 &
# Check if server is running
tail -f /workspace/logs/vllm_server.logThe vLLM server will be accessible at your RunPod URL (e.g., http://<runpod-url>:8000/v1/completions).
Deploy two parallel vLLM instances on separate A5090 GPUs (Blackwell architecture), each running the NVFP4 quantized model independently:
# GPU 0: Instance 1 on port 8000
CUDA_VISIBLE_DEVICES=0 nohup vllm serve \
pshashid/llama3.1B_8B_SQL_Finetuned_model \
--tensor-parallel-size 1 \
--kv-cache-dtype fp8 \
--max-model-len 128000 \
--dtype float16 \
--enable-prefix-caching \
--gpu-memory-utilization 0.90 \
--served-model-name sql-genie-gpu0 \
--port 8000 \
> /workspace/logs/vllm_gpu0.log 2>&1 &
# GPU 1: Instance 2 on port 8001
CUDA_VISIBLE_DEVICES=1 nohup vllm serve \
pshashid/llama3.1B_8B_SQL_Finetuned_model \
--tensor-parallel-size 1 \
--kv-cache-dtype fp8 \
--max-model-len 128000 \
--dtype float16 \
--enable-prefix-caching \
--gpu-memory-utilization 0.90 \
--served-model-name sql-genie-gpu1 \
--port 8001 \
> /workspace/logs/vllm_gpu1.log 2>&1 &Key configuration for A5090 parallel deployment (Blackwell NVFP4):
--max-model-len 128000(4× longer than A5000's 32000)--gpu-memory-utilization 0.90leverages 28-29GB per GPU--kv-cache-dtype fp8saves additional headroom for batch processing- Two independent full model instances maximize request concurrency
- Each instance serves requests independently from port 8000 and 8001
- Load balancer distributes requests round-robin across both instances
- Model is NVFP4 quantized (Blackwell hardware acceleration required)
Dataset (Gretel SQL 100k)
↓
ChatML Preprocessing
↓
QLoRA Fine-tuning (Unsloth + Liger)
↓
GGUF Export + HuggingFace Push (from GDrive)
↓
Adapter Merge
↓
4-bit Quantization (NVFP4)
↓
Safetensors + HuggingFace Push
Workflow breakdown:
- Fine-tuning - QLoRA adapters trained on Gretel SQL dataset
- Checkpoint Save - Saved to GDrive for persistence
- GGUF Export - First export format pushed to HuggingFace Hub
- Adapter Merge - Merge LoRA weights with base model
- Quantization - Apply NVFP4 (Blackwell-optimized) quantization
- Final Push - Quantized model uploaded as Safetensors format to HF Hub
The quantized Safetensors artifact is what's deployed on Blackwell GPUs (A5090) via vLLM.
Client Request
↓
Load Balancer / Router
↓
├─→ GPU0: vLLM Server (Port 8000) - Full Model Instance
├─→ GPU1: vLLM Server (Port 8001) - Full Model Instance
└─→ GPU2: vLLM Server (Port 8002) - Full Model Instance [optional]
↓
SQL Generation Response (OpenAI-compatible API)
Each GPU runs a complete, independent model instance. Requests are load-balanced across instances for parallel processing.
Parallel vs. Distributed Inference:
This architecture uses parallel inference with multiple full model instances rather than distributed inference with model sharding. Each GPU has a complete copy of the model and operates independently. This approach is optimal for models that fit on individual GPUs, providing simpler deployment, better fault isolation, and easier horizontal scaling compared to tensor-parallelism-based distributed inference.
| Parameter | Value | Notes |
|---|---|---|
| Model | meta-llama/Llama-3.1-8B-Instruct |
Base model from Meta |
| LoRA Rank (R) | 128 | Optimized for 8B model, balances capacity vs VRAM |
| LoRA Alpha | 128 | Scaling factor (R = Alpha for 1.0 scaling) |
| LoRA Dropout | 0.0 | Disabled (large batch size provides regularization) |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | All attention + feedforward projections |
| Max Seq Length | 1024 | Reduced from 2048 (quadratic attention cost) |
| Batch Size | 128 | Linear VRAM cost = fast training |
| Gradient Accumulation | 2 | Effective batch size: 256 |
| Learning Rate | 8e-4 | Scaled with batch size |
| Warmup Ratio | 0.05 | 5% linear warmup |
| Num Epochs | 1 | Single pass over dataset |
| Save Interval | Every 200 steps | Checkpoint frequency |
Effective batch size: 128 × 2 = 256 per gradient update
Why MAX_SEQ_LENGTH=1024?
The SQL dataset's longest sequences are ~500 tokens. Using 1024 provides comfortable headroom without the quadratic overhead of 2048. Initial estimate with 2048 was 159+ hours; reducing to 1024 cut training time dramatically while maintaining full dataset coverage.
| Dataset | Samples | Purpose |
|---|---|---|
| Gretel SQL | 100,000 | SQL generation pairs |
| Finetome | 100,000 | General instruction following |
| UltraChat | 200,000 | Conversational diversity |
| Total | 400,000 | Mixed training corpus |
All datasets converted to Llama-3.1 chat template format before training.
| Artifact | Location | Size | Format | Purpose |
|---|---|---|---|---|
| LoRA Adapter | sql-genie-lora/ |
~33 MB | PyTorch | Fine-tuned weights only |
| GGUF Model | HuggingFace Hub | ~9-10 GB | GGUF | Intermediate checkpoint |
| Merged + Quantized | HuggingFace Hub | ~4-5 GB | Safetensors (NVFP4) | Production deployment |
| Checkpoints | GDrive /models/Infra/ |
~33 MB each | LoRA | Training saves (every 200 steps) |
Deployment Artifact: Use the Safetensors NVFP4 model from HuggingFace Hub. Requires Blackwell architecture (A5090) for hardware-accelerated inference.
The model uses a two-stage quantization approach optimized for Blackwell architecture inference:
NVFP4 is NVIDIA's proprietary 4-bit floating-point format native to Blackwell GPUs. It provides hardware-accelerated inference with minimal accuracy loss compared to fp16.
Compression achieved: ~16GB (fp16) → ~4-5GB NVFP4 (4× reduction)
During token generation, attention key-value caches are stored in FP8 format with tensor-wise dynamic quantization. This further reduces memory consumption during inference without affecting output quality.
Configuration:
- Format: FP8 (8-bit floating point)
- Strategy: Tensor-wise (per-tensor quantization)
- Dynamic: Disabled (static quantization)
- Symmetric: True (symmetric range)
Quantization uses a calibration dataset of 128 representative SQL generation samples to determine optimal quantization parameters. Samples cover diverse SQL patterns (aggregations, joins, filtering, grouping) to ensure the quantized model generalizes well across different schemas.
Combined NVFP4 weights + FP8 KV-cache achieves:
- Model weights: 4× smaller (16GB → 4-5GB)
- Peak inference VRAM: 27-28GB → 14-16GB on A5090
- No degradation in SQL generation quality
- Hardware-accelerated inference on Blackwell GPUs
Hardware Limitation: A5000 uses Ada architecture which does not support NVFP4 quantization format. Requires full-precision fp16 model.
Model Instance: 1× Llama-3.1 8B-Instruct (fp16, unquantized)
| Metric | Value | Notes |
|---|---|---|
| Base Load VRAM | 23-24 GB | Full precision weights |
| Peak VRAM (256 tokens) | 23.8 GB | Limited headroom for KV-cache |
| Throughput (batch=1) | 18-22 tokens/sec | Slower than A5090 due to precision |
| Latency (128 tokens) | 5.8-6.4 sec | Higher latency vs quantized |
| Latency (256 tokens) | 11.6-12.8 sec | Memory bandwidth limited |
| Max Batch Size | 1 | No room for batching |
| Max Seq Length | 32000 | Limited by tight memory margins |
Note: For A5000 deployment, use unquantized models or alternative quantization formats (8-bit) instead of NVFP4.
Hardware Advantage: A5090 uses Blackwell architecture with full support for NVFP4 quantization, enabling 3-4× better performance than A5000.
Model Configuration: 2× parallel instances of NVFP4 quantized models (full capacity)
| Metric | A5000 Single | A5000 Dual (fp16) | A5090 Dual (NVFP4) | A5090 Improvement |
|---|---|---|---|---|
| Quantization | Full precision | Full precision | NVFP4 (Blackwell) | Hardware-accelerated |
| Base Load VRAM/GPU | 23-24 GB | 23-24 GB | 4-5 GB | 4.8-5.5× smaller |
| Available for KV-cache | <1 GB | <1 GB | 27-28 GB | 27× more headroom |
| Throughput | 18-22 tok/sec | 36-44 tok/sec | 68-75 tok/sec | 1.9× vs A5000 dual |
| Latency P99 | 6.4 sec | 3.2 sec | 1.7 sec | 1.9× vs A5000 dual |
| Max Concurrent Requests | 1 | 2-4 | 20-24 | 5-10× vs A5000 dual |
| Max Seq Length | 32000 | 32000 | 128000 | 4× longer contexts |
*A5090 NVFP4 compression (4-5GB) vs A5000 fp16 (23-24GB) enables dramatically better scaling and concurrency
Model Load Phase (A5000 24GB - Full Precision):
Base Model Weights (fp16): ~19 GB
Attention Buffers: ~2.5 GB
CUDA Runtime Overhead: ~0.5-1 GB
─────────────────────────────────
Total: ~22-24 GB (max capacity)
Headroom for KV-cache: <1 GB
Model Load Phase (A5090 32GB - NVFP4 Quantized):
NVFP4 weights combined with FP8 KV-cache quantization:
Base Model Weights (NVFP4): ~4-5 GB (4× compression vs fp16)
Attention Buffers: ~1-1.5 GB
CUDA Runtime Overhead: ~0.5-1 GB
─────────────────────────────────
Total Model: ~6-7 GB
Available for KV-cache (FP8): ~25-26 GB
Generation Phase (KV-Cache Growth - A5090 NVFP4):
Model Load: ~6-7 GB
Prompt Processing (512 tok): ~10-12 GB (FP8 KV-cache)
Token-by-token Generation: ~12-24 GB (grows linearly, FP8)
Formula (with FP8): VRAM_gen ≈ base_load + (seq_len × layers × hidden_size / 8)
A5090 Blackwell advantage: 25GB+ available for KV-cache enables:
- Very long sequences (128k tokens)
- Higher batch sizes (4-6 concurrent requests)
- No memory spillover
Architecture Requirement:
NVFP4 weights quantization only works on Blackwell architecture (A5090). Ada architecture (A5000) cannot load NVFP4 models. This is a hardware-level limitation due to Blackwell's native FP4 support.
All prompts must be formatted using the Llama-3.1 chat template before sending to vLLM. This is done on the client side, not the server.
Why: vLLM receives pre-formatted strings and does not apply tokenization templates itself. The template is automatically pulled from the model's tokenizer_config.json when loading.
Llama-3.1 Chat Template:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>
{user_query}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
This template is automatically applied when you load the tokenizer from HuggingFace and call apply_chat_template() with your messages. The add_generation_prompt=True parameter appends the assistant header to signal the model to start generating.
The formatted prompt string is then sent to vLLM's API endpoint.
| Flag | Value | Purpose |
|---|---|---|
--tensor-parallel-size 1 |
1 | Single GPU inference (no model sharding) |
--kv-cache-dtype fp8 |
fp8 | Reduces KV-cache memory by 2× |
--max-model-len |
128000 (A5090) / 32000 (A5000) | Supports long context windows |
--dtype float16 |
float16 | Fast mixed-precision compute |
--enable-prefix-caching |
enabled | Reuses KV-cache for repeated prompts |
--served-model-name |
custom | OpenAI-compatible model identifier |
Important: Prompts sent to vLLM must be pre-tokenized using the Llama-3 tokenizer's apply_chat_template() method on the client side. vLLM expects already-formatted strings, not raw text.
The vLLM server exposes an OpenAI-compatible API. The inference workflow is:
- Load tokenizer from HuggingFace Hub
- Build messages with system prompt (schema) and user query
- Apply template using
apply_chat_template(messages, add_generation_prompt=True) - POST formatted prompt to vLLM endpoint:
http://<pod-url>:8000/v1/completions - Receive generated SQL in response
Request format:
{
"model": "sql-genie",
"prompt": "[ChatML formatted prompt from tokenizer]",
"max_tokens": 200,
"temperature": 0.0
}Response format:
{
"id": "cmpl-12345",
"object": "text_completion",
"created": 1710510234,
"model": "sql-genie",
"choices": [
{
"text": "SELECT customer_id, SUM(amount) FROM orders WHERE created_at >= NOW() - INTERVAL 30 DAY GROUP BY customer_id;",
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 45,
"completion_tokens": 32,
"total_tokens": 77
}
}Key requirements:
- Tokenize prompts locally using the Llama-3.1 tokenizer's
apply_chat_template()method - Follow Llama-3.1 ChatML format (system + user roles, generation prompt appended)
- Send formatted string (not raw text) to vLLM API
- Use temperature 0.0 for deterministic SQL generation
Primary Dataset: Gretel SQL
| Dataset | Samples | Format | Use Case |
|---|---|---|---|
gretelai/synthetic_text_to_sql |
~100k | ChatML | SQL generation |
Data Pipeline:
Raw Dataset (JSON)
↓
ChatML Conversion
↓
Tokenization (Llama-3 template)
↓
Arrow/Parquet (vectorized training)
↓
Streaming to GPU during training
| Artifact | Location | Size | Use Case |
|---|---|---|---|
| LoRA Adapter | sql-genie-lora/ |
~33 MB | Fine-tuned weights |
| Merged Model | HuggingFace Hub | ~16 GB | Full model (unquantized) |
| Quantized Model (GGUF) | sql-genie-gguf/*.gguf |
~4-8 GB | Q4_K_M deployment artifact |
| Checkpoints | checkpoints/ |
~33 MB each | Mid-training saves (every 200 steps) |
For local development without Colab:
chmod +x setup.sh && ./setup.sh
source venv/bin/activate
pip install -r requirements.txt
# Run fine-tuning locally (requires NVIDIA GPU with 24GB+ VRAM)
jupyter notebook main_fine_tuning_pipeline.ipynb| Component | Technology | Purpose |
|---|---|---|
| Model | Llama-3.1 8B-Instruct | Base LLM for SQL generation |
| Fine-tuning | Unsloth + Liger Kernels | 2-10× training speedup |
| Quantization | NVFP4 (Blackwell) | Hardware-accelerated 4-bit format |
| Inference | vLLM | High-throughput batch processing |
| Parallel Inference | Multi-GPU instances | Independent servers per GPU |
| Deployment | OpenAI-compatible API | Standardized client interface |
| Model Hub | HuggingFace | Model versioning & distribution |
NVFP4 Note: Quantization format requires Blackwell architecture for hardware-accelerated inference (A5090, newer H100 variants). Ada architecture (A5000, A6000) requires fp16 deployment.
- Increase batch size if VRAM allows
- Enable prefix caching for repeated prompts
- Use fp8 KV-cache instead of fp16
- Add more GPU instances for parallel request handling
- Reduce
max-model-lenif extended context windows are not required - Use smaller batch size (1-2) for single concurrent requests
- Pre-warm the model with dummy requests to stabilize execution
- Enable prefix caching for commonly repeated prompts
Issue: CUDA OOM error during fine-tuning
Resolution: Reduce BATCH_SIZE from 8 to 4, reduce MAX_SEQ_LENGTH from 2048 to 1024, or reduce LORA_R from 64 to 32. Gradient checkpointing is enabled by default in Unsloth and provides additional memory savings.
Issue: vLLM fails to load model on target GPU
Resolution: Verify --kv-cache-dtype is set to fp8. Consider reducing --max-model-len from 64000 to 32000 if extended context is not required. Lower batch size in inference client to reduce peak memory consumption. Switch to quantized GGUF artifact if available.
Issue: Inference throughput below 20 tokens/second
Resolution: Check GPU utilization with nvidia-smi to determine if memory bandwidth is the bottleneck. Verify vLLM process is bound to correct GPU via logs. Increase batch size if request latency allows. Profile CPU-GPU communication for potential bottlenecks.