-
Notifications
You must be signed in to change notification settings - Fork 0
Description
[FEATURE] Add GGUF Model Support via llama.cpp
User Story: As a user with a VRAM-limited GPU (e.g., 8GB-12GB), I want to be able to use quantized GGUF Vision-Language Models so that I can run the application's core features with reasonable performance and without running out of memory.
Problem: The current application relies exclusively on the transformers library, which loads unquantized PyTorch/SafeTensors models. These models are very large and require significant VRAM (16GB+ recommended), making the application slow or unusable for a large portion of the potential user base.
Proposed Solution: Integrate the llama-cpp-python library as an alternative backend for model loading and inference.
-
Modify
vlm_profiles.pyto include new loader and generation functions specifically for GGUF models. -
The UI should allow users to select GGUF models, potentially by pointing to a local file.
-
The application will need to detect the model type and route it to the correct backend (
transformersorllama.cpp).
Goal: Dramatically lower the VRAM and system RAM requirements, making PlotCaption accessible and performant for a much wider range of hardware. This is the top priority for improving accessibility.
Source: This feature was suggested and proven to be viable by user willdone on Reddit, who successfully patched in a Q8_0 GGUF for testing.