This repository provides a sample implementation inspired by my prior internship work on vision-language model (VLM) benchmarking and fine-tuning for safety-critical perception tasks (e.g., traffic light and scene understanding).
All data and code are synthetic and intended solely for research demonstration purposes.
This project presents a reproducible framework for fine-tuning, evaluating, and benchmarking modern Vision-Language Models (VLMs) on perception-aligned tasks.
The design focuses on robustness under visual uncertainty and cross-modal consistency, resembling the core ideas used in large-scale safety perception evaluation pipelines.
- π§ VLM Integration Layer β unified interface for models such as LLaVA-NeXT, PaliGemma, Fuyu, InternVideo2.
- π― Fine-tuning Pipeline β modular PyTorch pipeline for supervised or instruction-tuned adaptation using multimodal data.
- π Evaluation Metrics β supports visual-textual accuracy, consistency, and reasoning alignment metrics.
- π§© Prompt and Response Benchmarking β evaluates LLM reasoning coherence under degraded perception (e.g., low-light or occluded scenes).
- β‘ Fully open and synthetic β no private datasets or internal assets are included.