A Python script that uses a local Ollama multimodal model to generate captions for your images in bulk. You can use the prompt to guide the vision model to include certain keywords, to describe a certain person by their name. It features a rich, interactive terminal user interface (TUI) for easy operation, configuration, and live progress tracking. This is mostly a helper tool for preparing image datasets for training with FLUX. They are captions, as unlike Stable Diffusion, FLUX relies on natural language processing over keyword processing.
- Interactive TUI: A user-friendly, menu-driven interface built with
richandgum. No need to edit the script to change settings! - Flexible Image Selection: Process an entire directory of images or use the file picker to select specific images.
- Live Progress Logging: A beautiful, real-time table shows you which files are being processed, their status, and a preview of the generated caption.
- Smart Feedback: Uses emojis and colors to clearly indicate successes, skips, failures, and warnings for low-quality (e.g., single-word) captions.
- Persistent Configuration: Your last-used settings (model, prompt, image source) are automatically saved to a
config.jsonfile for your next session. - Cross-Platform: Built with Python, it's designed to be compatible with macOS, Linux, and Windows.
Before you begin, ensure you have the following installed and running:
- Python 3.x
- Ollama: The script requires a running Ollama instance.
- A Multimodal Ollama Model: You need a model capable of processing images, such as
moondream.ollama pull moondream
- Rich: A Python library for rich text and beautiful formatting in the terminal.
pip install rich
- Gum: A tool for glamorous shell scripts, used for the interactive menus.
- macOS:
brew install gum - Other Systems: See the official Gum installation guide.
- macOS:
-
Install Dependencies: Make sure you have installed Python, Rich, and Gum as listed in the requirements section.
-
Start Ollama: Ensure the Ollama application is running and the server is active.
-
Run the Script: Save the code as
ollama_captionizer.pyand run it from your terminal:python3 ollama_captionizer.py
-
Use the Menu: You will be greeted by the main menu, where you can:
- Set Image Source: Choose a directory or select specific image files.
- Edit Prompt: Customize the prompt sent to the model.
- Start Captioning: Begin the process.
Captions will be saved as .txt files with the same name as the original image (e.g., my_photo.jpg -> my_photo.txt).
This script is written in Python and is designed to be cross-platform. It should work on macOS, Linux, and Windows provided the dependencies are met.
A key feature is that it communicates with the Ollama server over its network API (e.g., http://localhost:11434). This means you do not need to modify the script to handle different executable names like ollama.exe on Windows.
The primary consideration for cross-platform use is ensuring that the gum command-line tool is properly installed and accessible in your system's PATH.
