[FEATURE] Add GGUF Model Support via llama.cpp

### [FEATURE] Add GGUF Model Support via llama.cpp

**User Story:** As a user with a VRAM-limited GPU (e.g., 8GB-12GB), I want to be able to use quantized GGUF Vision-Language Models so that I can run the application's core features with reasonable performance and without running out of memory.

**Problem:** The current application relies exclusively on the `transformers` library, which loads unquantized PyTorch/SafeTensors models. These models are very large and require significant VRAM (16GB+ recommended), making the application slow or unusable for a large portion of the potential user base.

**Proposed Solution:** Integrate the `llama-cpp-python` library as an alternative backend for model loading and inference.

-   Modify `vlm_profiles.py` to include new loader and generation functions specifically for GGUF models.

-   The UI should allow users to select GGUF models, potentially by pointing to a local file.

-   The application will need to detect the model type and route it to the correct backend (`transformers` or `llama.cpp`).

**Goal:** Dramatically lower the VRAM and system RAM requirements, making PlotCaption accessible and performant for a much wider range of hardware. This is the top priority for improving accessibility.

**Source:** This feature was suggested and proven to be viable by user `willdone` on Reddit, who successfully patched in a Q8_0 GGUF for testing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] Add GGUF Model Support via llama.cpp #1