ImageInsightLLM is a creative, edge-ready Streamlit app that transforms your images and ideas into meaningful captions, summaries, and even new AI-generated visuals! Powered by open-source LLMs, BLIP, OCR, and Hugging Face APIs, itβs your all-in-one visual intelligence agent.
- Image Captioning: Upload any image and get a smart, context-aware caption using BLIP (Salesforce/blip-image-captioning-base).
- OCR (Text Extraction): Extracts readable text from images using EasyOCR, filtering for clarity and relevance.
- Dynamic Summarization: Combines the image caption and extracted text, then summarizes it with a state-of-the-art LLM (facebook/bart-large-cnn).
- Text-to-Image Generation: Enter a prompt and generate a brand new image using Hugging Faceβs FLUX.1-dev model.
- Edge-Ready: Designed to run efficiently on local devicesβno GPU required for core features.
- Modern UI: Clean, tabbed Streamlit interface for seamless user experience.
- Upload Image β 2. Caption Generated (BLIP) β 3. OCR Text Extraction β 4. Dynamic LLM Summary
- If text is found: Caption + Text β Summarized.
- If no text: Caption alone is summarized.
- Enter Prompt β 2. Image Generated via Hugging Face API β 3. View & Download
- Upload Image:
- Generated Caption:
"A group of people standing on top of a mountain with their hands raised."
- Extracted Text:
(If present in the image, e.g., "SUMMIT 2025")
- Summary:
"A group of people celebrate at the mountain summit. The text 'SUMMIT 2025' is visible."
- Prompt:
"A futuristic city skyline at sunset, with flying cars and neon lights."
- Generated Image:
- Result:
The app displays the generated image with your prompt as the caption.
flowchart TD
A[πΌοΈ ImageInsightLLM App Start] --> B{User Selects Tab}
%% Image Captioning Flow
B -->|Tab 1| C[πΈ Image Captioning Tab]
C --> D[π€ Upload Image<br/>JPG/JPEG/PNG]
D --> E{Image Uploaded?}
E -->|No| D
E -->|Yes| F[πΌοΈ Display Uploaded Image]
F --> G[π€ Load BLIP Model<br/>Salesforce/blip-image-captioning-base]
G --> H[π Generate Image Caption<br/>BLIP Processing]
H --> I[β
Display Caption Result]
I --> J[π€ Start OCR Process<br/>EasyOCR Reader]
J --> K[π Extract Text from Image]
K --> L[π― Filter Text<br/>Confidence > 0.5<br/>Length > 2 chars<br/>Alphabetic only]
L --> M{Text Found?}
M -->|Yes| N[π Display Extracted Text]
M -->|No| O[βΉοΈ No Text Detected]
N --> P[π§ Prepare Summary Input<br/>Caption + Extracted Text]
O --> Q[π§ Prepare Summary Input<br/>Caption Only]
P --> R[π€ Load BART Summarizer<br/>facebook/bart-large-cnn]
Q --> R
R --> S[βοΈ Generate Dynamic Summary]
S --> T{Summary Success?}
T -->|Yes| U[β¨ Display AI Summary]
T -->|No| V[β οΈ Fallback Summary<br/>Manual Template]
U --> W[π Process Complete]
V --> W
%% Text-to-Image Flow
B -->|Tab 2| X[π¨ Text-to-Image Generation Tab]
X --> Y[βοΈ Enter Text Prompt]
Y --> Z{Prompt Entered?}
Z -->|No| Y
Z -->|Yes| AA[π Click Generate Image Button]
AA --> BB[π Loading Spinner Active]
BB --> CC[π Call Hugging Face API<br/>black-forest-labs/FLUX.1-dev]
CC --> DD[π‘ Send POST Request<br/>with Authorization Token]
DD --> EE{API Response?}
EE -->|Success 200| FF[π₯ Receive Image Data]
EE -->|Error| GG[β Display Error Message]
FF --> HH{Image Format?}
HH -->|Raw Binary| II[πΌοΈ Convert BytesIO to PIL Image]
HH -->|Base64 JSON| JJ[π Decode Base64 to PIL Image]
II --> KK[π¨ Display Generated Image]
JJ --> KK
GG --> LL[π Ready for Next Attempt]
KK --> MM[π Generation Complete<br/>Image Ready for Download]
%% Styling
classDef startEnd fill:#e1f5fe,stroke:#01579b,stroke-width:3px,color:#000
classDef process fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#000
classDef decision fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
classDef success fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000
classDef error fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000
classDef model fill:#f1f8e9,stroke:#33691e,stroke-width:2px,color:#000
class A,W,MM,LL startEnd
class C,D,F,G,H,I,J,K,L,N,O,P,Q,R,S,X,Y,AA,BB,CC,DD,FF,II,JJ,KK process
class B,E,M,T,Z,EE,HH decision
class U,MM success
class GG,V error
class G,R,CC model
ImageInsightLLM/
βββ app.py # Main Streamlit app
βββ requirements.txt # Python dependencies
βββ Dockerfile # Containerization support
βββ .env # Hugging Face API token (not committed)
βββ test_hf_token.py # Token validation script
βββ README.md # Project documentation
βββ LICENSE # MIT License
- Clone the repo:
git clone https://github.com/NikithaKunapareddy/ImageInsightLLM.git cd ImageInsightLLM - Set up Python 3.11 (required)
- Install dependencies:
pip install -r requirements.txt
- Add your Hugging Face token:
- Create a
.envfile:HF_TOKEN=your_huggingface_token
- Create a
- Run the app:
streamlit run app.py
- Python 3.11
- See
requirements.txtfor all dependencies
LLMs actually used in this app:
- Summarization LLM: facebook/bart-large-cnn (for dynamic text/image summaries)
- Image Captioning LLM: Salesforce/blip-image-captioning-base (for generating captions from images)
- Text-to-Image LLM: black-forest-labs/FLUX.1-dev (for generating images from text prompts)
Other core components:
This project is designed to work with open-source LLMs that are suitable for machines with 8GB of RAM. Some recommended models you can try:
- DistilBART (sshleifer/distilbart-cnn-12-6) β lightweight summarization LLM
- DistilGPT2 (distilgpt2) β compact text generation
- TinyLlama (TinyLlama/TinyLlama-1.1B-Chat-v1.0) β chat and general LLM tasks
- Phi-2 (microsoft/phi-2) β efficient and small LLM for reasoning
These models are open-source, easy to run locally, and ideal for edge devices or laptops with limited memory.
- Hugging Face for their open models & API
- Salesforce Research for BLIP
- Facebook AI for BART
- JaidedAI for EasyOCR
- Streamlit for rapid prototyping
- All open-source contributors & the Python community
ImageInsightLLM was built to empower users with instant, intelligent visual understandingβright from their desktop. Whether youβre a developer, researcher, or creative, this app brings the power of multimodal AI to your fingertips.
Unleash the power of vision and languageβright from your desktop. ππΌοΈπ€
MIT License. See LICENSE for details.
