SigLIP + Gemma Vision-Language Architecture
The Sum of Vision and Language
SIGMA-VL stands for SigLIP + Gemma Vision-Language Architecture, representing the fusion of two powerful components:
- SigLIP: The vision encoder that processes and understands images
- Gemma: The language model that generates text and understands language
- VL: Vision-Language, the multimodal domain this project operates in
The name SIGMA (Σ) is inspired by the mathematical summation symbol, perfectly capturing how this architecture "sums" or combines vision and language modalities into a unified multimodal intelligence system. SIGMA-VL doesn't just process vision and language separately, but intelligently combines them, creating a sum that is greater than its parts through advanced multimodal fusion.
A PyTorch implementation of vision-language models built from scratch. This project implements the SigLIP vision encoder, Gemma language model, and multimodal fusion mechanisms entirely from scratch, along with a complete training and inference pipeline optimized for production use.
This is a fully functional vision-language model that can understand images and generate text descriptions, answer visual questions, and perform multimodal reasoning. The architecture combines a 27-layer Vision Transformer for image processing with an 18-layer decoder-only language model, totaling 2.92 billion parameters.
I built this implementation to demonstrate deep understanding of transformer architectures and to have full control over the model components. All attention mechanisms, normalization layers, and fusion strategies are implemented from first principles rather than using high-level abstractions. This approach makes it easier to customize, debug, and understand what's happening under the hood.
The model has three main components:
Vision Encoder (SigLIP)
- 27-layer Vision Transformer that processes 224×224 images
- Divides images into 14×14 patches, producing 256 image tokens
- Each token has dimension 1152
- Uses bidirectional self-attention with 12 heads
- Total parameters: ~86 million
Multimodal Projector
- Simple linear layer mapping 1152-dim vision embeddings to 2048-dim language space
- This is where vision and language modalities meet
- Parameters: ~1.5 million
Language Model (Gemma)
- 18-layer decoder-only transformer
- Hidden dimension: 2048
- Vocabulary: 257,216 tokens
- Max sequence length: 8,192 tokens
- Parameters: ~2.4 billion
- Uses Grouped-Query Attention, Rotary Position Embedding, RMSNorm, and SwiGLU activation
Total: 2,923,466,480 parameters (2.92B). Model size is 5.85 GB in FP16, 11.69 GB in FP32. Peak GPU memory during inference is around 13.8 GB with FP16.
The processing pipeline flows like this:
-
Image Encoding: The SigLIP encoder takes an image and extracts visual features through patch embeddings and 27 transformer layers. Output is 256 image tokens, each representing a 14×14 patch.
-
Modality Alignment: The projector transforms vision embeddings from 1152 to 2048 dimensions so they can be processed by the language model.
-
Sequence Construction: Image tokens get inserted into the text sequence. The format is
<image>tokens first, then BOS token, then the user's prompt. The processor handles all the tokenization and formatting details. -
Language Modeling: The Gemma decoder processes the combined sequence using causal attention. Each of the 18 layers applies Grouped-Query Attention with RoPE, followed by RMSNorm and SwiGLU MLP.
-
Text Generation: Generation happens autoregressively with KV-cache to avoid recomputing attention for previous tokens. Stops at EOS token or max length.
Grouped-Query Attention (GQA)
I implemented GQA to reduce memory usage during inference. Instead of having separate key-value heads for each query head, multiple query heads share the same KV heads. This cuts memory by about 75% compared to standard multi-head attention with minimal quality loss.
The implementation uses 16 query heads and 2 key-value heads (each shared across 8 query heads). This is a common pattern in modern language models like Llama 2.
Rotary Position Embedding (RoPE)
RoPE encodes position by rotating query and key vectors in the complex plane. This gives relative position information that generalizes better to longer sequences than absolute embeddings.
The math: for position m and dimension i, we compute θ_i = 10000^(-2i/d), then apply a rotation matrix R(θ_i * m) to the query and key vectors. This creates position-dependent rotations that encode relative distances.
RMSNorm
I used RMSNorm instead of LayerNorm because it's slightly more efficient (no mean centering) and works just as well for training stability. The formula is y = (x / sqrt(mean(x²))) * (1 + γ) where γ is a learnable parameter.
KV-Cache
This is crucial for fast generation. Without KV-cache, generating 100 tokens requires computing attention over all previous tokens each time (1+2+...+100 = 5,050 attention computations). With KV-cache, we only compute attention for the new token (100 computations total). That's roughly a 10x speedup for longer sequences.
The cache stores key-value states per layer, and we concatenate new states as we generate. This is implemented in the KVCache class which maintains separate caches for each transformer layer.
Multimodal Fusion
Image tokens are integrated through a token replacement mechanism. The processor identifies <image> placeholder tokens in the tokenized text and replaces them with actual image embeddings from the vision encoder. This lets the language model attend to both image and text tokens uniformly - they're all just tokens in the sequence from the model's perspective.
The training infrastructure supports both full fine-tuning and parameter-efficient methods.
QLoRA
I implemented QLoRA for efficient fine-tuning. It combines 4-bit quantization with Low-Rank Adaptation, reducing memory by about 75% while keeping model quality. The base model weights are quantized to 4 bits, and we only train small LoRA adapters (default rank 64, alpha 16). Only the adapter weights get gradients; the base model stays frozen.
This makes it possible to fine-tune on consumer GPUs. You can train on a 16GB GPU that would normally need 40GB+ for full fine-tuning.
Training Features
- Gradient accumulation for effective larger batch sizes
- Mixed precision (FP16/BF16) for memory efficiency
- Learning rate scheduling with warmup and linear decay
- Gradient checkpointing for additional memory savings
- Checkpointing and resume capability
- Weights & Biases integration for experiment tracking
- Ready for distributed training (DDP/FSDP)
Data Formats
The pipeline supports multiple formats:
JSONL (one example per line):
{"image": "path/to/image.jpg", "text": "A description of the image"}VQA format:
{"image": "path/to/image.jpg", "question": "What is in this image?", "answer": "A cat"}COCO format: Standard COCO annotation files
The data loader auto-detects format and handles preprocessing.
Inference Optimizations
The enhanced inference module includes:
- KV-cache for faster generation
- Batch processing for multiple images
- Advanced sampling (top-p, top-k, temperature)
- Repetition penalty to reduce loops
- Stop token handling
- torch.compile integration (~30% speedup)
Benchmark Results
Tested on Tesla T4 GPU (16GB VRAM) with FP16:
- Average latency: 340 ms per image
- Median latency: 329 ms
- P95 latency: 507 ms
- Throughput: 2.94 requests/second
- Peak memory: 13.8 GB
Latency ranges from 182 ms (minimum) to 507 ms (maximum), with standard deviation of 100 ms. The variation depends on output length - shorter outputs (1-3 words) finish in 150-260 ms, while longer descriptions take 800-900 ms.
Memory Requirements
| Configuration | GPU Memory | Notes |
|---|---|---|
| FP16 | 5.85 GB allocated | Current default |
| FP16 Peak | 13.8 GB reserved | With activations |
| FP32 | ~11.7 GB | Estimated |
| 8-bit quantized | ~6 GB | Estimated |
| 4-bit quantized | ~3 GB | Estimated |
| QLoRA training | ~4 GB | Estimated |
COCO Captioning
Evaluated on 100 images from COCO 2017 validation set:
| Metric | Score |
|---|---|
| BLEU-1 | 0.0425 |
| BLEU-2 | 0.0425 |
| BLEU-3 | 0.0425 |
| BLEU-4 | 0.0425 |
| ROUGE-L | 0.1326 |
The evaluation framework can handle the full dataset (5,000+ images). These results are from a subset for quick testing.
VQAv2 Question Answering
Evaluated on 50 questions from VQAv2 validation set:
| Metric | Score |
|---|---|
| Overall Accuracy | 40.00% |
| Yes/No Questions | 42.55% |
| Number Questions | 0.00% |
| Other Questions | 0.00% |
The framework supports the full dataset (214,354+ questions). The model struggles with counting tasks, which is a known limitation of vision-language models without specialized training.
Robustness
Tested edge cases:
- Very long prompts: Handles truncation gracefully
- Very short prompts: Produces reasonable responses
- Complex hypothetical questions: Can handle some reasoning
- Counting tasks: Limited accuracy (as expected)
All tested edge cases processed without errors (4/4 success rate).
API Server
FastAPI-based server with:
- Async request handling
- Batch processing
- Health checks
- Request/response logging
- Error handling
- Docker support
Deployment
Docker configuration included:
- GPU support via nvidia-docker
- Environment variable config
- Volume mounting for models/data
- Health checks
Monitoring
- Weights & Biases integration
- Performance metrics logging
- Error tracking
- Resource monitoring
Requirements
- Python 3.10+
- PyTorch 2.3.0+
- CUDA 11.8+ (for GPU)
- 16GB+ RAM (32GB recommended)
- GPU with 8GB+ VRAM (full model)
- 12GB+ VRAM for FP16 inference
- 6GB+ VRAM for 8-bit quantization
- 3GB+ VRAM for 4-bit quantization
Setup
git clone https://github.com/imcoza/SIGMA-VL.git
cd SIGMA-VL
pip install -r requirements.txt
# Download model weights from HuggingFace
# Place in ./models/paligemma-3b-pt-224/Model weights need to be downloaded from HuggingFace. The implementation uses pretrained weights for fine-tuning, which is standard practice - training 3B models from scratch requires massive compute resources.
Basic Inference
from inference_enhanced import EnhancedInference
inferencer = EnhancedInference(
model_path="./models/paligemma-3b-pt-224",
use_torch_compile=True, # Optional speedup
)
result = inferencer.generate(
image_path="path/to/image.jpg",
prompt="describe this image in detail",
max_new_tokens=100,
temperature=0.8,
)
print(result)Command Line
python inference_enhanced.py \
--model_path ./models/paligemma-3b-pt-224 \
--prompt "describe this image" \
--image_file_path image.jpg \
--max_tokens_to_generate 100 \
--temperature 0.8 \
--use_torch_compileTraining
# Full fine-tuning
python train.py \
--model_path ./models/paligemma-3b-pt-224 \
--train_data_path ./data/train.jsonl \
--val_data_path ./data/val.jsonl \
--output_dir ./checkpoints \
--num_epochs 3 \
--batch_size 4 \
--learning_rate 2e-4
# QLoRA (memory efficient)
python train.py \
--model_path ./models/paligemma-3b-pt-224 \
--train_data_path ./data/train.jsonl \
--use_qlora \
--qlora_r 64 \
--qlora_alpha 16API Server
python api_server.py \
--host 0.0.0.0 \
--port 8000 \
--model-path ./models/paligemma-3b-pt-224
# Or with Docker
docker-compose upAPI docs at http://localhost:8000/docs
sigma-vl/
├── config.py # Configuration management
├── data_utils.py # Data loading and preprocessing
├── evaluation.py # Evaluation metrics and benchmarking
├── inference.py # Basic inference script
├── inference_enhanced.py # Enhanced inference with optimizations
├── modeling_gemma.py # Gemma language model implementation
├── modeling_siglip.py # SigLIP vision encoder
├── processing_paligemma.py # Image/text preprocessing
├── train.py # Training and fine-tuning script
├── utils.py # Model loading utilities
├── utils_optimization.py # Optimization utilities
├── api_server.py # FastAPI production server
├── benchmarks/ # Benchmark evaluation scripts
│ └── coco_eval.py # COCO captioning evaluation
├── tests/ # Unit tests
│ └── test_model.py
├── Dockerfile # Docker configuration
├── docker-compose.yml # Docker Compose setup
├── requirements.txt # Python dependencies
└── README.md # This file
I built this implementation from scratch for several reasons:
-
Understanding: Implementing attention, RoPE, and normalization from first principles forces you to understand how they actually work, not just how to use them.
-
Customization: Having full control over the architecture makes it easy to experiment with modifications, add new features, or optimize for specific use cases.
-
Debugging: When something goes wrong, you can trace through your own code rather than debugging through layers of abstractions.
-
Learning: This is a great way to learn transformer architectures deeply. Every component is implemented with detailed comments explaining the math and shapes.
-
Production: Understanding the internals helps with optimization, debugging production issues, and making informed decisions about quantization, pruning, and other optimizations.
The codebase follows software engineering best practices: comprehensive type hints, error handling, logging, and documentation. The evaluation framework has been tested on standard benchmarks (COCO, VQAv2) with results documented above.
Architecture
- Vision encoder: 27 transformer layers with patch embeddings
- Language model: 18 decoder layers with GQA, RoPE, RMSNorm
- Multimodal fusion: Learned projection with token replacement
- All components implemented from scratch
Training
- Complete pipeline with gradient accumulation and mixed precision
- QLoRA for efficient fine-tuning (75% memory reduction)
- Gradient checkpointing for memory savings
- Learning rate scheduling with warmup/decay
- Checkpointing and resume
Inference
- KV-cache for 10x faster generation
- torch.compile for 30% speedup
- Batch processing
- Advanced sampling (top-p, top-k, temperature, repetition penalty)
Production
- FastAPI server with async support
- Docker containerization with GPU support
- Evaluation metrics (BLEU, ROUGE-L, VQA accuracy)
- Weights & Biases integration
- Monitoring and observability
The current implementation has some limitations:
- Counting accuracy is poor (0% on VQA number questions) - this is a known limitation of vision-language models without specialized training
- BLEU scores are relatively low compared to state-of-the-art models - this could be improved with better fine-tuning or larger models
- The evaluation was done on small subsets for speed - full dataset evaluation would give more reliable metrics
Future improvements could include:
- Better fine-tuning strategies for improved performance
- Support for more vision-language tasks (object detection, segmentation)
- Optimizations for edge deployment
- Multi-GPU training improvements
- Better handling of long sequences
Contributions welcome. Please:
- Fork the repository
- Create a feature branch
- Add tests for new features
- Follow existing code style and type hints
- Submit a pull request with a clear description
This project is for educational and research purposes. Please check the original PaliGemma license from Google Research for usage terms.
- Google Research for the PaliGemma architecture (SIGMA-VL is inspired by PaliGemma)
- HuggingFace for the transformers library and model weights
- The open-source ML community
For implementation details, see:
modeling_gemma.pyandmodeling_siglip.pyfor architecture code- Code comments for training best practices
evaluation.pyfor evaluation framework documentation