This app uses Vision Language Models (VLM) to generate detailed captions for images through multi-turn conversations, using a separately hosted VLM models via an "OpenAI-compatible" API (including locally hosted models).
This is currently "beta" and features may change.
A common use case would be to automate captioning large numbers of images for later text-to-image diffusion model training or classification tasks.
Slightly old video, app overview with install: VLM Caption, multi-turn, data-driven image captioning
API Service Setup (for local LLM users)
API Key use (for cloud API users)
-
Visit https://github.com/victorchall/vlm-caption/releases click on
Assetsto expand the the download links. You have two options to use the application.a. The cli (command line) app is just a zip file with
caption.yamlandcaption_openai.exe. Unzip whereever you like.- Edit the
caption.yamlwith the text editor of your choice then run thecaption_openai.exe.
Delete it to uninstall. To update, you can delete the old version or unzip it to the same location and overwrite.
b. The electron version is a GUI app delivered as a one-click installer that will give you a GUI to edit configurations instead of manually editing the caption.yaml file. It will install and add a shortcut to your desktop.
You will need to Enable CORS on your VLM service. This is a simple toggle in LM Studio.
Uninstall via
Add or Remove programsin Windows. To update, just download a newer version and install again, it will overwrite the old version. - Edit the
- Multi-turn Conversations: Configure a series of prompts to guide the VLM through detailed image analysis
- Multiple API Support: Compatible with most common local/offline LLM servers, OpenAI, Anthropic, Google Gemini, etc.
- Global Metadata: Optionally include character databases, location reference, or other material in system prompts
- Hint Sources: Optionally include additional per-image or per-directory metadata (file paths, folder metadata.json, {imagename}.json) in prompts on a per-image basis, with easy developer extension
- Bulk Processing: Process entire directories of images
- Output: Saves captions as .txt files alongside original images
The GUI is a simple wrapper to edit caption.yaml if you wish to just use CLI.
All settings are configured through caption.yaml when using the CLI version. The GUI version has the same config, just presented in GUI form for ease.
The Configuration file MUST be modified for your project and system.
Most of this is covered in API Service Setup for local users.
If you just want to use the CLI, the entire app is driven by caption.yaml. Edit then run the CLI by running python caption_openai.py.
# API endpoint - can be local LLM server or cloud service
base_url: "http://localhost:1234/v1"
# base_url: "https://api.openai.com/v1"
# API key handling
api_key: "" # Leave empty for local servers
# api_key: "OPENAI_API_KEY" # Use env var name for cloud APIs
# Model selection
model: "gemma-3-27b-it"
# model: "gpt-4o-mini"
# model: "claude-sonnet-4-20250514"
max_tokens: 16384System Prompt: Base instructions for the VLM. Think of this as a global instruction that you likely will not modify per project.
system_prompt: "You are to analyze an image and provide information based on what is visible in the image. Do not embellish, preferring to focus on factual information."Global Metadata: Optional file to include additional context (e.g., character databases)
global_metadata_file: "character_info.txt"This is a great way to add a large text file with descriptions of all the locations, characters, and objects that might appear in any of the images to help the VLM identify things by their proper names instead of generic pronouns.
If you don't have this information, set it to empty quotes "" like this or leave it empty on the GUI:
global_metadata_file: ""See the example provided, character_info.txt, that includes detailed descriptions about the Final Fantasy VII Rebirth universe. Keep in mind that very large documents will increase VRAM usage, but this is also a very useful feature to help aid the VLM identifying things like specific characters and locations by their proper names. Otherwise, VLMs may not recognize specific things in the image unless they are very common on the internet. I.e. A good VLM model will probably recognize Mario or Cloud Strife, but not some side character that is less popular.
Multi-turn Prompts: Sequential questions asked to the VLM
prompts:
- "What character(s) do you think are present? Support your decision based on physical features."
- "What other objects are present in the image? What is the general scene?"
- "Describe their outfits in detail."
- "Describe the general composition from a photographer's perspective."
- "Can you categorize the artistic style or medium?"
- "Summarize the description in four sentences. No markdown or special formatting."-
The final prompt's response becomes the saved caption. I strongly recommend asking for a summary similar to the above example.
-
The more prompts you include the more VRAM and context will be required.
-
Check your service documentation for configuration to make sure the context window is sufficient.
-
The above example is very strong when used with a more powerful VLM (like Gemma3 27B) and a large global metadata file with detailed descriptions.
-
Experiment with small amounts of data to tweak your prompts.
Enable additional context sources that get prepended to the first prompt.
hint_sources:
- full_path # Includes file path information
- json # tries to read a .json file of the same basename for each image
- metadata # tries to read a metadata.json in the same folder as each imageIf a source isn't available for a given image, it is skipped. For instances if you use json above and an image named cloud_strife.jpeg has no cloud_strife.json in the same directory, the hint is skipped.
See HINTSOURCES.md for details on available hint sources and how they work.
Developers can also add their own custom hint sources.
base_directory: "C:/path/to/images" # Root directory to process
# base_directory: "/mnt/path/to/images" # POSIX style path
recursive: false # Enable recursive subdirectory processingMostly self explanatory. Paste in the path to the directory you want processed and set the recursive to true to walk all subdirectories.
-
Prompt Tuning: Read PROMPTS.MD for more tips on tuning your system prompt and prompt series.
-
Presort Data: Sort your data into subdirectories ahead of time with directory names that might help steer the model, such as
c:/myimages/cloud strifeandc:/myimages/rufus shinra/, then use thefull_pathhint_source. -
Utilize metadata: Try writing a text file with details of the overall "universe" of your image data and reference the txt file with
global_metadata_fileincaption.yaml. Use the exampleglobal_metadata_file: "character_info.txt"as a reference for what to do. -
Model Selection: Different models may excel at different types of image analysis.
If you have a >=24GB GPU, I highly recommend Gemma 27B (
gemma-3-27b-it). Q4_K_M is about 19.5GB, leaving ~4GB for context and kv caching which is typically sufficient. If you have 32GB or more, use a larger quantization (Q5_K_M, Q8_0, etc).InternVL3-14B is a good alternative smaller model for users with 16GB or less.
Models published by
unslothorlmstudio-communityare generally reliable and will work. Sometimes some models are not properly marked as vision capable or are misconfigured. If it doesn't seem to work, see if there is an unsloth or lmstudio-community verson.Not all models are suitable for multi-turn conversation. Try different models, or you can try a single prompt.
Test the model directly in the Chat window in LM Studio to see how it responds and to workshop your series of prompts. Or try running VLM Caption on a directory with just a few images, then check the .txt outputs.
-
Cuda OOM or Failures: You may need to reduce the number of prompts in your chain if you run out of VRAM, or select a smaller quantization of the model (Q3_K_S, etc), or select a smaller model. You may also need to configure your service to increase the context size as the default is often 4096, which you could exceed with very long chains of prompts, a very large metadata sources, leading to unintended outputs depending on how your service truncates context. Refer to the documentation of your service for more information.
-
Metadata collection: Adding context is incredibly powerful. Think ahead when you decide to collect and build a new dataset considering how you can capture other metadata.
For instance, if you are writing webscrapers, make sure to collect metadata from the webpage as you go rather than blindly just download each image. Perhaps you might include the website address or full URI of the webpage, the
<title>tag from the webpage, or thealt-textfield. Save this information with each image, or in a database. Then feed into the VLM with ahint_source. New hint sources are very easy for an amateur Python programmer to write, or you can have an LLM write for you.See HINTSOURCES.md for more information.
-
Batch concurrency: If you use a VLM host that supports sequence concurrent serving such as llama.cpp (via
-np Narg) or vllm (via-max-num-seqs Narg) you can increase overall throughput by configuring the host and VLM Caption to operate with concurrent requests. This is not supported by LM Studio, so if you care about throughput of bulk process captioning, llama.cpp or even better vllm is suggested as your host.
Generally, the higher the concurrency, the more memory is required (larger context and kv cache) but the higher the aggregate token throughput. This setting can increase overall processing rates by 2-8x on consumer GPUs and possibly far more on workstation or datacenter grade hardware.
This is not to be confused with "batch processing" in the host documentation, which is means batch processing of prefill tokens. However, the concurrency setting here is "batch" for the decode process (token prediction).
If you have multiple GPUs, consider using VLLM which supports tensor parallel as well.
Example vllm serve command to serve 16 parallel requests, using 2x3090 GPUs with tensor parallel:
#!/bin/bash
vllm serve QuantTrio/Qwen3-VL-32B-Instruct-AWQ \
--dtype auto \
--limit-mm-per-prompt '{"video":0,"image":1}' \
--media-io-kwargs '{"image":{"width":768,"height":768}}' \
--max-model-len '64K' \
--max-num-seqs 16 \
--max-num-batched-tokens 256 \
--tensor-parallel-size 2 \
--override-generation-config '{"temperature": 0.5,"top_p":0.95,"min_p":0.05,"top-k": 30}'Example llama.cpp serving for 8 parallel requests, context -c 32768 is divided across them.
llama-server -np 8 -c 32768 --mmproj "mmproj-Qwen3-VL-32B-Instruct-F16.gguf" --model "Qwen3-VL-32B-Instruct-Q4_K_M.gguf" -dev cuda0 --top-k 30 --top-p 0.95 --min-p 0.05 --temp 0.5VLM Caption has a cap on concurrent batch size it will dispatch to the host set in the UI or caption.yaml. This should be set equal or higher than the host configuration to ensure the host is saturated.
