Skip to content

imcoza/SIGMA-VL

Repository files navigation

SIGMA-VL

SigLIP + Gemma Vision-Language Architecture

The Sum of Vision and Language

About the Name

SIGMA-VL stands for SigLIP + Gemma Vision-Language Architecture, representing the fusion of two powerful components:

  • SigLIP: The vision encoder that processes and understands images
  • Gemma: The language model that generates text and understands language
  • VL: Vision-Language, the multimodal domain this project operates in

The name SIGMA (Σ) is inspired by the mathematical summation symbol, perfectly capturing how this architecture "sums" or combines vision and language modalities into a unified multimodal intelligence system. SIGMA-VL doesn't just process vision and language separately, but intelligently combines them, creating a sum that is greater than its parts through advanced multimodal fusion.


A PyTorch implementation of vision-language models built from scratch. This project implements the SigLIP vision encoder, Gemma language model, and multimodal fusion mechanisms entirely from scratch, along with a complete training and inference pipeline optimized for production use.

What This Project Is

This is a fully functional vision-language model that can understand images and generate text descriptions, answer visual questions, and perform multimodal reasoning. The architecture combines a 27-layer Vision Transformer for image processing with an 18-layer decoder-only language model, totaling 2.92 billion parameters.

I built this implementation to demonstrate deep understanding of transformer architectures and to have full control over the model components. All attention mechanisms, normalization layers, and fusion strategies are implemented from first principles rather than using high-level abstractions. This approach makes it easier to customize, debug, and understand what's happening under the hood.

Architecture Overview

The model has three main components:

Vision Encoder (SigLIP)

  • 27-layer Vision Transformer that processes 224×224 images
  • Divides images into 14×14 patches, producing 256 image tokens
  • Each token has dimension 1152
  • Uses bidirectional self-attention with 12 heads
  • Total parameters: ~86 million

Multimodal Projector

  • Simple linear layer mapping 1152-dim vision embeddings to 2048-dim language space
  • This is where vision and language modalities meet
  • Parameters: ~1.5 million

Language Model (Gemma)

  • 18-layer decoder-only transformer
  • Hidden dimension: 2048
  • Vocabulary: 257,216 tokens
  • Max sequence length: 8,192 tokens
  • Parameters: ~2.4 billion
  • Uses Grouped-Query Attention, Rotary Position Embedding, RMSNorm, and SwiGLU activation

Total: 2,923,466,480 parameters (2.92B). Model size is 5.85 GB in FP16, 11.69 GB in FP32. Peak GPU memory during inference is around 13.8 GB with FP16.

How It Works

The processing pipeline flows like this:

  1. Image Encoding: The SigLIP encoder takes an image and extracts visual features through patch embeddings and 27 transformer layers. Output is 256 image tokens, each representing a 14×14 patch.

  2. Modality Alignment: The projector transforms vision embeddings from 1152 to 2048 dimensions so they can be processed by the language model.

  3. Sequence Construction: Image tokens get inserted into the text sequence. The format is <image> tokens first, then BOS token, then the user's prompt. The processor handles all the tokenization and formatting details.

  4. Language Modeling: The Gemma decoder processes the combined sequence using causal attention. Each of the 18 layers applies Grouped-Query Attention with RoPE, followed by RMSNorm and SwiGLU MLP.

  5. Text Generation: Generation happens autoregressively with KV-cache to avoid recomputing attention for previous tokens. Stops at EOS token or max length.

Key Technical Details

Grouped-Query Attention (GQA)

I implemented GQA to reduce memory usage during inference. Instead of having separate key-value heads for each query head, multiple query heads share the same KV heads. This cuts memory by about 75% compared to standard multi-head attention with minimal quality loss.

The implementation uses 16 query heads and 2 key-value heads (each shared across 8 query heads). This is a common pattern in modern language models like Llama 2.

Rotary Position Embedding (RoPE)

RoPE encodes position by rotating query and key vectors in the complex plane. This gives relative position information that generalizes better to longer sequences than absolute embeddings.

The math: for position m and dimension i, we compute θ_i = 10000^(-2i/d), then apply a rotation matrix R(θ_i * m) to the query and key vectors. This creates position-dependent rotations that encode relative distances.

RMSNorm

I used RMSNorm instead of LayerNorm because it's slightly more efficient (no mean centering) and works just as well for training stability. The formula is y = (x / sqrt(mean(x²))) * (1 + γ) where γ is a learnable parameter.

KV-Cache

This is crucial for fast generation. Without KV-cache, generating 100 tokens requires computing attention over all previous tokens each time (1+2+...+100 = 5,050 attention computations). With KV-cache, we only compute attention for the new token (100 computations total). That's roughly a 10x speedup for longer sequences.

The cache stores key-value states per layer, and we concatenate new states as we generate. This is implemented in the KVCache class which maintains separate caches for each transformer layer.

Multimodal Fusion

Image tokens are integrated through a token replacement mechanism. The processor identifies <image> placeholder tokens in the tokenized text and replaces them with actual image embeddings from the vision encoder. This lets the language model attend to both image and text tokens uniformly - they're all just tokens in the sequence from the model's perspective.

Training and Fine-tuning

The training infrastructure supports both full fine-tuning and parameter-efficient methods.

QLoRA

I implemented QLoRA for efficient fine-tuning. It combines 4-bit quantization with Low-Rank Adaptation, reducing memory by about 75% while keeping model quality. The base model weights are quantized to 4 bits, and we only train small LoRA adapters (default rank 64, alpha 16). Only the adapter weights get gradients; the base model stays frozen.

This makes it possible to fine-tune on consumer GPUs. You can train on a 16GB GPU that would normally need 40GB+ for full fine-tuning.

Training Features

  • Gradient accumulation for effective larger batch sizes
  • Mixed precision (FP16/BF16) for memory efficiency
  • Learning rate scheduling with warmup and linear decay
  • Gradient checkpointing for additional memory savings
  • Checkpointing and resume capability
  • Weights & Biases integration for experiment tracking
  • Ready for distributed training (DDP/FSDP)

Data Formats

The pipeline supports multiple formats:

JSONL (one example per line):

{"image": "path/to/image.jpg", "text": "A description of the image"}

VQA format:

{"image": "path/to/image.jpg", "question": "What is in this image?", "answer": "A cat"}

COCO format: Standard COCO annotation files

The data loader auto-detects format and handles preprocessing.

Inference and Performance

Inference Optimizations

The enhanced inference module includes:

  • KV-cache for faster generation
  • Batch processing for multiple images
  • Advanced sampling (top-p, top-k, temperature)
  • Repetition penalty to reduce loops
  • Stop token handling
  • torch.compile integration (~30% speedup)

Benchmark Results

Tested on Tesla T4 GPU (16GB VRAM) with FP16:

  • Average latency: 340 ms per image
  • Median latency: 329 ms
  • P95 latency: 507 ms
  • Throughput: 2.94 requests/second
  • Peak memory: 13.8 GB

Latency ranges from 182 ms (minimum) to 507 ms (maximum), with standard deviation of 100 ms. The variation depends on output length - shorter outputs (1-3 words) finish in 150-260 ms, while longer descriptions take 800-900 ms.

Memory Requirements

Configuration GPU Memory Notes
FP16 5.85 GB allocated Current default
FP16 Peak 13.8 GB reserved With activations
FP32 ~11.7 GB Estimated
8-bit quantized ~6 GB Estimated
4-bit quantized ~3 GB Estimated
QLoRA training ~4 GB Estimated

Evaluation Results

COCO Captioning

Evaluated on 100 images from COCO 2017 validation set:

Metric Score
BLEU-1 0.0425
BLEU-2 0.0425
BLEU-3 0.0425
BLEU-4 0.0425
ROUGE-L 0.1326

The evaluation framework can handle the full dataset (5,000+ images). These results are from a subset for quick testing.

VQAv2 Question Answering

Evaluated on 50 questions from VQAv2 validation set:

Metric Score
Overall Accuracy 40.00%
Yes/No Questions 42.55%
Number Questions 0.00%
Other Questions 0.00%

The framework supports the full dataset (214,354+ questions). The model struggles with counting tasks, which is a known limitation of vision-language models without specialized training.

Robustness

Tested edge cases:

  • Very long prompts: Handles truncation gracefully
  • Very short prompts: Produces reasonable responses
  • Complex hypothetical questions: Can handle some reasoning
  • Counting tasks: Limited accuracy (as expected)

All tested edge cases processed without errors (4/4 success rate).

Production Features

API Server

FastAPI-based server with:

  • Async request handling
  • Batch processing
  • Health checks
  • Request/response logging
  • Error handling
  • Docker support

Deployment

Docker configuration included:

  • GPU support via nvidia-docker
  • Environment variable config
  • Volume mounting for models/data
  • Health checks

Monitoring

  • Weights & Biases integration
  • Performance metrics logging
  • Error tracking
  • Resource monitoring

Installation

Requirements

  • Python 3.10+
  • PyTorch 2.3.0+
  • CUDA 11.8+ (for GPU)
  • 16GB+ RAM (32GB recommended)
  • GPU with 8GB+ VRAM (full model)
  • 12GB+ VRAM for FP16 inference
  • 6GB+ VRAM for 8-bit quantization
  • 3GB+ VRAM for 4-bit quantization

Setup

git clone https://github.com/imcoza/SIGMA-VL.git
cd SIGMA-VL
pip install -r requirements.txt

# Download model weights from HuggingFace
# Place in ./models/paligemma-3b-pt-224/

Model weights need to be downloaded from HuggingFace. The implementation uses pretrained weights for fine-tuning, which is standard practice - training 3B models from scratch requires massive compute resources.

Usage

Basic Inference

from inference_enhanced import EnhancedInference

inferencer = EnhancedInference(
    model_path="./models/paligemma-3b-pt-224",
    use_torch_compile=True,  # Optional speedup
)

result = inferencer.generate(
    image_path="path/to/image.jpg",
    prompt="describe this image in detail",
    max_new_tokens=100,
    temperature=0.8,
)
print(result)

Command Line

python inference_enhanced.py \
    --model_path ./models/paligemma-3b-pt-224 \
    --prompt "describe this image" \
    --image_file_path image.jpg \
    --max_tokens_to_generate 100 \
    --temperature 0.8 \
    --use_torch_compile

Training

# Full fine-tuning
python train.py \
    --model_path ./models/paligemma-3b-pt-224 \
    --train_data_path ./data/train.jsonl \
    --val_data_path ./data/val.jsonl \
    --output_dir ./checkpoints \
    --num_epochs 3 \
    --batch_size 4 \
    --learning_rate 2e-4

# QLoRA (memory efficient)
python train.py \
    --model_path ./models/paligemma-3b-pt-224 \
    --train_data_path ./data/train.jsonl \
    --use_qlora \
    --qlora_r 64 \
    --qlora_alpha 16

API Server

python api_server.py \
    --host 0.0.0.0 \
    --port 8000 \
    --model-path ./models/paligemma-3b-pt-224

# Or with Docker
docker-compose up

API docs at http://localhost:8000/docs

Project Structure

sigma-vl/
├── config.py                 # Configuration management
├── data_utils.py             # Data loading and preprocessing
├── evaluation.py             # Evaluation metrics and benchmarking
├── inference.py              # Basic inference script
├── inference_enhanced.py     # Enhanced inference with optimizations
├── modeling_gemma.py         # Gemma language model implementation
├── modeling_siglip.py        # SigLIP vision encoder
├── processing_paligemma.py   # Image/text preprocessing
├── train.py                  # Training and fine-tuning script
├── utils.py                  # Model loading utilities
├── utils_optimization.py     # Optimization utilities
├── api_server.py             # FastAPI production server
├── benchmarks/               # Benchmark evaluation scripts
│   └── coco_eval.py          # COCO captioning evaluation
├── tests/                    # Unit tests
│   └── test_model.py
├── Dockerfile                # Docker configuration
├── docker-compose.yml        # Docker Compose setup
├── requirements.txt         # Python dependencies
└── README.md                 # This file

Why From Scratch?

I built this implementation from scratch for several reasons:

  1. Understanding: Implementing attention, RoPE, and normalization from first principles forces you to understand how they actually work, not just how to use them.

  2. Customization: Having full control over the architecture makes it easy to experiment with modifications, add new features, or optimize for specific use cases.

  3. Debugging: When something goes wrong, you can trace through your own code rather than debugging through layers of abstractions.

  4. Learning: This is a great way to learn transformer architectures deeply. Every component is implemented with detailed comments explaining the math and shapes.

  5. Production: Understanding the internals helps with optimization, debugging production issues, and making informed decisions about quantization, pruning, and other optimizations.

The codebase follows software engineering best practices: comprehensive type hints, error handling, logging, and documentation. The evaluation framework has been tested on standard benchmarks (COCO, VQAv2) with results documented above.

Technical Highlights

Architecture

  • Vision encoder: 27 transformer layers with patch embeddings
  • Language model: 18 decoder layers with GQA, RoPE, RMSNorm
  • Multimodal fusion: Learned projection with token replacement
  • All components implemented from scratch

Training

  • Complete pipeline with gradient accumulation and mixed precision
  • QLoRA for efficient fine-tuning (75% memory reduction)
  • Gradient checkpointing for memory savings
  • Learning rate scheduling with warmup/decay
  • Checkpointing and resume

Inference

  • KV-cache for 10x faster generation
  • torch.compile for 30% speedup
  • Batch processing
  • Advanced sampling (top-p, top-k, temperature, repetition penalty)

Production

  • FastAPI server with async support
  • Docker containerization with GPU support
  • Evaluation metrics (BLEU, ROUGE-L, VQA accuracy)
  • Weights & Biases integration
  • Monitoring and observability

Limitations and Future Work

The current implementation has some limitations:

  • Counting accuracy is poor (0% on VQA number questions) - this is a known limitation of vision-language models without specialized training
  • BLEU scores are relatively low compared to state-of-the-art models - this could be improved with better fine-tuning or larger models
  • The evaluation was done on small subsets for speed - full dataset evaluation would give more reliable metrics

Future improvements could include:

  • Better fine-tuning strategies for improved performance
  • Support for more vision-language tasks (object detection, segmentation)
  • Optimizations for edge deployment
  • Multi-GPU training improvements
  • Better handling of long sequences

Contributing

Contributions welcome. Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new features
  4. Follow existing code style and type hints
  5. Submit a pull request with a clear description

License

This project is for educational and research purposes. Please check the original PaliGemma license from Google Research for usage terms.

Acknowledgments

  • Google Research for the PaliGemma architecture (SIGMA-VL is inspired by PaliGemma)
  • HuggingFace for the transformers library and model weights
  • The open-source ML community

Additional Resources

For implementation details, see:

  • modeling_gemma.py and modeling_siglip.py for architecture code
  • Code comments for training best practices
  • evaluation.py for evaluation framework documentation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors