"This is... Requiem. What you are seeing is indeed the truth. But you will never arrive at the truth that is going to happen." — Giorno Giovanna
Pearl Jam Requiem is an AI-powered Stand that watches your mom's cooking videos and breaks them down into step-by-step recipes — automatically. You upload a video, and the AI pipeline hears every word (Whisper), sees every frame (LLaVA), and remembers every step (SQLite). The result is an interactive recipe player that loops each cooking step so you can follow along without rewinding.
Named after Tonio Trussardi's Stand Pearl Jam from JoJo's Bizarre Adventure: Diamond is Unbreakable — a Stand that channels culinary perfection. This is its Requiem evolution: it doesn't just cook, it understands cooking.
| Ability | What It Does | Powered By |
|---|---|---|
| HEAR | Transcribes spoken instructions from video (Hindi → English) | Faster-Whisper (medium, int8) |
| SEE | Analyzes extracted video frames to describe cooking actions | LLaVA via Ollama |
| EXTRACT | Pulls key frames at the start of each spoken segment | FFmpeg |
| REMEMBER | Stores every recipe and step persistently | SQLite + SQLAlchemy |
| PLAY | Interactive step-by-step player with auto-looping video | React + TypeScript |
Pearl_Jam_Requiem/
├── backend/ # FastAPI — The Stand's Brain
│ ├── app/
│ │ ├── main.py # App entry, CORS, static mounts
│ │ ├── api/routes.py # API endpoints (upload, list, detail)
│ │ ├── db/
│ │ │ ├── database.py # SQLite connection (nusqa.db)
│ │ │ └── models.py # Recipe & RecipeStep ORM models
│ │ ├── schemas/recipe.py # Pydantic response schemas
│ │ └── services/
│ │ ├── audio.py # Whisper transcription pipeline
│ │ ├── vision.py # LLaVA frame analysis
│ │ └── video.py # FFmpeg frame extraction
│ ├── media/
│ │ ├── uploads/ # Uploaded video files
│ │ └── temp/ # Extracted frame images
│ └── requirements.txt
│
└── frontend/ # React + Vite — The Stand's Face
└── src/
├── App.tsx # Router (/ and /recipe/:id)
├── Home.tsx # Recipe grid + upload
├── RecipePlayer.tsx # Video player + step guide
├── index.css # Tailwind + global styles
└── main.tsx # React entry point
When you upload a cooking video, this is what happens behind the scenes:
📹 Video Upload
│
├─ 1. HEAR (Whisper)
│ └─ Transcribes audio → segments with timestamps
│ Model: faster-whisper (medium, int8 quantized for CPU)
│ Language: Hindi → English translation
│
├─ 2. For each segment:
│ │
│ ├── EXTRACT (FFmpeg)
│ │ └─ Pulls a high-quality JPEG frame at segment start
│ │
│ ├── SEE (LLaVA via Ollama)
│ │ └─ Describes the cooking action in the frame
│ │ Uses Whisper transcript as context for accuracy
│ │
│ └── SAVE (SQLite)
│ └─ Saves step incrementally (so you see progress live)
│
└─ ✅ Recipe fully processed — all steps ready to play
The entire pipeline runs as a background task — the upload returns immediately while the AI works. No waiting.
┌─────────────────────────┐ ┌──────────────────────────────────┐
│ recipes │ │ recipe_steps │
├─────────────────────────┤ ├──────────────────────────────────┤
│ id INT (PK) │──┐ │ id INT (PK) │
│ title STRING │ │ │ recipe_id INT (FK) ◄───┘
│ video_filename STRING │ │ │ step_number INT │
│ created_at STRING │ └──► │ start_time FLOAT │
└─────────────────────────┘ │ end_time FLOAT │
│ instruction TEXT │
│ visual_description TEXT │
│ video_loop_url STRING │
└──────────────────────────────────┘
All routes are prefixed with /api.
| Method | Endpoint | Params | Description |
|---|---|---|---|
GET |
/ |
— | Health check. Returns {"stand_user": "Faraz"} |
POST |
/api/upload |
title (query), file (form) |
Upload a video — triggers background AI pipeline |
GET |
/api/recipes |
skip, limit (query) |
List all recipes with their steps |
GET |
/api/recipes/{id} |
id (path) |
Get a single recipe with all processed steps |
Response shape for a recipe:
{
"id": 1,
"title": "Mummy's Chicken Curry",
"video_filename": "chicken_curry.mp4",
"created_at": "2026-03-07T14:30:00",
"steps": [
{
"id": 1,
"step_number": 1,
"start_time": 0.0,
"end_time": 15.5,
"instruction": "Add oil to the pan and heat it up",
"visual_description": "A hand pouring oil into a heated wok",
"video_loop_url": "/media/temp/frame_at_0.jpg"
}
]
}- Recipe Grid — Cards showing title, date, and step count
- Upload Button — Accepts
video/*, sends file + auto-generated title - Empty State — Friendly message when no recipes exist yet
- Error Banner — Shows if the backend is unreachable
- Split Layout — Video on the left, step guide on the right (responsive)
- Auto-Looping Video — Each step loops within its
start_time→end_timerange - Step Navigation — Click any step to seek the video instantly
- Current Step Overlay — Shows the active instruction on the video
- Play/Pause Control — Manual override button
- AI Vision Notes — Each step shows what LLaVA "saw" in the frame
- Processing State — "Still analyzing..." message if the AI pipeline hasn't finished
- Python 3.10+
- Node.js 18+
- FFmpeg installed and on PATH
- Ollama installed (ollama.com)
git clone https://github.com/farazmirzax/pearl-jam-requiem.git
cd pearl-jam-requiemcd backend
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # macOS/Linux
pip install -r requirements.txt
uvicorn app.main:app --reloadBackend runs on http://127.0.0.1:8000
cd frontend
npm install
npm run devFrontend runs on http://localhost:5173
ollama serve # Start the Ollama server
ollama pull llava # Download the LLaVA vision model (~4.7GB)- Open http://localhost:5173
- Click Upload Video and select a cooking video
- Watch the terminal — the AI pipeline logs every step as it processes
- Once done, click the recipe card to open the interactive step-by-step player
| Setting | Location | Default | Description |
|---|---|---|---|
| Whisper model | services/audio.py |
medium |
Model size (tiny, base, small, medium, large) |
| Compute type | services/audio.py |
int8 |
Quantization (optimized for 16GB RAM / CPU) |
| Audio language | services/audio.py |
hi (Hindi) |
Source language for transcription |
| Task | services/audio.py |
translate |
translate (→ English) or transcribe (keep original) |
| Vision model | services/vision.py |
llava |
Ollama model for frame analysis |
| Database | db/database.py |
sqlite:///./nusqa.db |
SQLite database file |
| CORS origins | app/main.py |
localhost:5173, 127.0.0.1:5173 |
Allowed frontend origins |
| Package | Purpose |
|---|---|
| FastAPI | Web framework + background tasks |
| Uvicorn | ASGI server |
| SQLAlchemy | ORM for SQLite |
| Pydantic v2 | Request/response validation |
| faster-whisper | Speech-to-text (CPU-optimized) |
| ollama (Python) | Client for local LLaVA model |
| ffmpeg-python | Video frame extraction |
| Pillow | Image processing |
| Package | Purpose |
|---|---|
| React 19 | UI library |
| TypeScript | Type safety |
| Vite | Build tool + dev server |
| Tailwind CSS 4 | Utility-first styling |
| React Router 7 | Client-side routing |
| Axios | HTTP client |
| Lucide React | Icon library |
In JoJo's Bizarre Adventure: Diamond is Unbreakable, Tonio Trussardi is a chef whose Stand, Pearl Jam, infuses his cooking with healing power. Every dish he makes is perfect — tailored to the person eating it.
This project is Pearl Jam's Requiem evolution. It doesn't cook the food — it watches someone cook and breaks down the knowledge into something anyone can follow. It hears, it sees, it remembers.
Your mom's recipes, preserved by a Stand. Arrivederci to forgotten family dishes.
Faraz — @farazmirzax