Multi-Modal Image Analysis and Generation Platform

A full-stack web application that leverages cutting-edge AI models for image analysis and generation. Built with FastAPI (backend) and Next.js (frontend).

🚀 Features

Image Analysis (3 Features)

Caption Generation - Automatically generates descriptive captions for images
Visual Question Answering (VQA) - Ask questions about images and get AI-powered answers
Object Detection - Identifies and lists objects present in images

Image Generation (2 Features)

Text-to-Image - Generate images from text descriptions
Image Variation - Create variations of existing images with custom modifications

🛠️ Tech Stack

Backend:

FastAPI (Python web framework)
SQLAlchemy (Database ORM)
Google Gemini 2.5 Flash (Vision analysis)
HuggingFace Stable Diffusion XL (Image generation)
AWS S3 (Image storage)
SQLite/PostgreSQL (Database)

Frontend:

Next.js 16 (React framework)
TypeScript
Tailwind CSS
App Router

📋 Prerequisites

Python 3.11+
Node.js 18+
AWS Account (S3 bucket)
Google AI Studio API Key
HuggingFace API Token

🔧 Setup

Backend Setup

Navigate to backend directory:

cd backend

Install dependencies:

uv sync

Create .env file:

# AI API Keys
GOOGLE_API_KEY=your_google_api_key
HUGGINGFACE_TOKEN=your_huggingface_token

# AWS S3
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_REGION=ap-southeast-1
AWS_BUCKET_NAME=your-bucket-name

# Database (optional - defaults to SQLite)
DATABASE_URL=sqlite:///./app.db

# Generation Provider (optional)
GENERATION_PROVIDER=huggingface

Start the backend server:

uv run uvicorn app.main:app --reload --port 8000

Backend will be available at http://127.0.0.1:8000

Frontend Setup

Navigate to frontend directory:

cd frontend

Install dependencies:

npm install

Create .env.local file:

NEXT_PUBLIC_API_URL=http://127.0.0.1:8000

Start the development server:

npm run dev

Frontend will be available at http://localhost:3000

🚀 Quick Start (Using Scripts)

Easy way to start both servers at once:

Windows (PowerShell)

.\scripts\start-app-split.ps1

Linux/Mac (Bash)

chmod +x scripts/*.sh    # Make scripts executable (first time only)
./scripts/start-app-split.sh

This opens two terminal windows:

Backend server on http://127.0.0.1:8000
Frontend server on http://localhost:3000

See scripts/README.md for more options (single terminal, individual servers, etc.)

🎯 Usage

Start both backend and frontend servers (see Quick Start above)
Open http://localhost:3000 in your browser
Select a feature from the home page
Upload images or enter prompts
View AI-generated results

🏗️ Application Architecture

High-Level Overview

┌─────────────┐      HTTP/REST      ┌──────────────┐
│   Next.js   │◄───────────────────►│   FastAPI    │
│   Frontend  │     (CORS enabled)   │   Backend    │
│  (Port 3000)│                      │  (Port 8000) │
└─────────────┘                      └──────┬───────┘
                                            │
                    ┌───────────────────────┼───────────────────────┐
                    │                       │                       │
              ┌─────▼──────┐         ┌─────▼──────┐         ┌─────▼──────┐
              │   Google   │         │ HuggingFace│         │   AWS S3   │
              │  Gemini    │         │   Stable   │         │   Storage  │
              │ 2.5 Flash  │         │ Diffusion  │         │            │
              │  (Vision)  │         │    (Gen)   │         │ (Presigned │
              └────────────┘         └────────────┘         │    URLs)   │
                                                            └─────┬──────┘
                                                                  │
                                                            ┌─────▼──────┐
                                                            │  SQLite/   │
                                                            │ PostgreSQL │
                                                            │  Database  │
                                                            └────────────┘

Component Breakdown

Frontend Layer (Next.js)

UI Components: Reusable React components (ImageUpload, LoadingSpinner, etc.)
Pages: Route-specific pages for each feature (/caption, /vqa, etc.)
API Service: TypeScript client for backend communication with type safety
State Management: React hooks for local state and async operations

Backend Layer (FastAPI)

API Endpoints: RESTful endpoints for analysis and generation
AI Service: Integration layer for AI model APIs (Gemini, HuggingFace)
Database Models: SQLAlchemy models for data persistence
S3 Service: Image upload/storage with presigned URL generation
Background Tasks: Async job processing for image generation

Data Flow

User uploads image → Frontend sends to /upload endpoint
Backend uploads to S3 → Returns presigned URL
User triggers analysis/generation → API calls AI services
Results stored in database → Returned to frontend
Frontend displays results with images from S3

📖 API Usage Examples

Example 1: Caption Generation

Request:

curl -X POST "http://127.0.0.1:8000/api/analyze/caption" \
  -H "Content-Type: application/json" \
  -d '{
    "image_url": "https://your-bucket.s3.amazonaws.com/uploads/image.png"
  }'

Response:

{
  "success": true,
  "data": {
    "id": 1,
    "caption": "A serene mountain landscape with clouds at sunset",
    "created_at": "2025-12-20T10:30:00"
  }
}

Example 2: Visual Question Answering

Request:

curl -X POST "http://127.0.0.1:8000/api/analyze/vqa" \
  -H "Content-Type: application/json" \
  -d '{
    "image_url": "https://your-bucket.s3.amazonaws.com/uploads/image.png",
    "question": "What color is the car?"
  }'

Response:

{
  "success": true,
  "data": {
    "id": 2,
    "question": "What color is the car?",
    "answer": "The car is red",
    "created_at": "2025-12-20T10:31:00"
  }
}

Example 3: Text-to-Image Generation (Async)

Step 1 - Start Generation:

curl -X POST "http://127.0.0.1:8000/api/generate/text-to-image" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "A futuristic city with flying cars at night"
  }'

Response:

{
  "success": true,
  "message": "Image generation job started",
  "data": {
    "job_id": "abc123",
    "status": "pending",
    "check_status_url": "/api/jobs/abc123"
  }
}

Step 2 - Check Status (Poll every 3-5 seconds):

curl -X GET "http://127.0.0.1:8000/api/jobs/abc123"

Response (completed):

{
  "success": true,
  "data": {
    "job_id": "abc123",
    "task_type": "text_to_image",
    "status": "completed",
    "result_image_url": "https://your-bucket.s3.amazonaws.com/generated/result.png",
    "created_at": "2025-12-20T10:32:00",
    "completed_at": "2025-12-20T10:32:25"
  }
}

🔑 API Key Configuration Guide

1. Google Gemini API Key

Visit Google AI Studio
Sign in with your Google account
Click "Get API Key" → "Create API key"

Copy the key and add to backend .env:

GOOGLE_API_KEY=AIzaSyXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

2. HuggingFace API Token

Visit HuggingFace Tokens
Sign in or create an account
Click "New token" → Select "Read" access

Copy the token and add to backend .env:

HUGGINGFACE_TOKEN=hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

3. AWS S3 Configuration

Log into AWS Console
Navigate to IAM → Users → Create User
Attach policy: AmazonS3FullAccess (or custom policy)
Create Access Key → Download credentials
Create an S3 bucket in your preferred region
Configure CORS (see S3_CORS_SETUP.md for details)

Add credentials to backend .env:

AWS_ACCESS_KEY_ID=AKIAXXXXXXXXXXXXXXXX
AWS_SECRET_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
AWS_REGION=ap-southeast-1
AWS_BUCKET_NAME=your-bucket-name

Important:

For image downloads to work properly, configure S3 CORS (see S3_CORS_SETUP.md)
For production, use IAM roles and restrict S3 bucket policies appropriately
Download feature works without CORS using fallback methods, but CORS improves UX

🎯 Usage

Multi-Modal-Image-Analysis-and-Generation-Platform/
├── backend/
│   ├── app/
│   │   ├── main.py           # FastAPI app & endpoints
│   │   ├── ai_service.py     # AI model integration
│   │   ├── models.py         # Database models
│   │   ├── database.py       # Database configuration
│   │   └── s3_service.py     # S3 upload handling
│   ├── requirements.txt
│   └── pyproject.toml
├── frontend/
│   ├── app/
│   │   ├── page.tsx          # Home page
│   │   ├── caption/          # Caption generation page
│   │   ├── vqa/              # Visual Q&A page
│   │   ├── object-detection/ # Object detection page
│   │   ├── text-to-image/    # Text-to-image page
│   │   └── variation/        # Image variation page
│   ├── components/           # Shared React components
│   ├── lib/
│   │   └── api.ts            # API service layer
│   └── package.json
└── EVALUATION.md             # Testing documentation

📊 API Endpoints

Analysis Endpoints

POST /upload - Upload image to S3
POST /api/analyze/caption - Generate image caption
POST /api/analyze/vqa - Visual question answering
POST /api/analyze/object-detection - Detect objects

Generation Endpoints (Async)

POST /api/generate/text-to-image - Generate image from text
POST /api/generate/variation - Create image variation
GET /api/jobs/{job_id} - Check generation job status

✅ Testing

All features have been tested and verified working. See EVALUATION.md for detailed test results and examples.

Test Results: 6/6 features passing (100%)

🎥 Demo

Demo Video: Link

To create a demo video:

Record a 5-minute walkthrough showing all features
Upload to YouTube, Loom, or similar platform
Add the link above

🔐 Security Notes

S3 bucket uses presigned URLs (7-day expiry)
CORS enabled for frontend access
API keys managed via environment variables
Database credentials secured

📝 License

This project is for educational purposes.

👨‍💻 Author

Akhilesh Malthi

Note: Make sure both backend and frontend servers are running for the application to work properly.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
backend		backend
evaluation-images		evaluation-images
frontend		frontend
sample-images		sample-images
scripts		scripts
.gitignore		.gitignore
EVALUATION.md		EVALUATION.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Multi-Modal Image Analysis and Generation Platform

🚀 Features

Image Analysis (3 Features)

Image Generation (2 Features)

🛠️ Tech Stack

📋 Prerequisites

🔧 Setup

Backend Setup

Frontend Setup

🚀 Quick Start (Using Scripts)

Windows (PowerShell)

Linux/Mac (Bash)

🎯 Usage

🏗️ Application Architecture

High-Level Overview

Component Breakdown

📖 API Usage Examples

Example 1: Caption Generation

Example 2: Visual Question Answering

Example 3: Text-to-Image Generation (Async)

🔑 API Key Configuration Guide

1. Google Gemini API Key

2. HuggingFace API Token

3. AWS S3 Configuration

🎯 Usage

📊 API Endpoints

Analysis Endpoints

Generation Endpoints (Async)

✅ Testing

🎥 Demo

🔐 Security Notes

📝 License

👨‍💻 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages