📄 Multimodal AI Assistant — Documentation

🚀 Overview

The Multimodal AI Assistant is a Streamlit-powered interactive system that supports text, image, audio, and video generation from a single input message. The assistant intelligently routes the user query to the appropriate generation model using a LangGraph-powered routing system. It supports:

Natural language text generation (Gemini 1.5 Flash)
Image creation (Gemini 2.0 Flash Image Model)
Audio generation (ElevenLabs)
Video generation (ModelsLab)

🧠 Architecture & Components

⚙️ Technologies Used

Component	Tech / API
Text Generation	Google Gemini 1.5 Flash
Image Generation	Google Gemini 2.0 Flash (Image Preview)
Audio Generation	ElevenLabs API
Video Generation	ModelsLab CogVideoX
State Management	LangGraph (StateGraph + Router Node)
Web UI	Streamlit
Prompt Refinement	LLM-based System Prompts

🛠️ Environment Setup

Ensure you have the following environment variables defined in a .env file:

GOOGLE_API_KEY=your_gemini_api_key
ELEVEN_API_KEY=your_elevenlabs_api_key
STABLE_DIFFUSION_API_KEY=your_modelslab_api_key

🧩 Modules & Functionality

🔐 `get_and_verify_keys()`

Validates presence of all required API keys.
Stops execution if any are missing.

🧠 `initialize_models()`

Loads and caches all model clients:
- Google Gemini LLM
- Google Gemini image model
- ElevenLabs client
Returns a centralized models dictionary.

🧬 `GraphState`

TypedDict structure defining state:

messages: List[BaseMessage]
generation_output: str
generation_type: str

🔁 Modular Generation Functions

1️⃣ `generate_text_response(state)`

Sends user messages to Gemini and returns AI response.

2️⃣ `generate_audio(state)`

Extracts "speakable" text using a system prompt.
Uses ElevenLabs for TTS (Text-to-Speech).
Saves output as generated_audio.mp3.

3️⃣ `generate_image(state)`

Uses Gemini 2.0 preview model to generate a base64 image.
Saves and decodes to generated_image.png.

4️⃣ `generate_video(state)`

Refines the input with a system prompt.
Calls ModelsLab's /text2video API.
Polls result and returns video URL.

🔀 Routing Logic

🧭 `router_node(state)`

Uses basic keyword matching to infer generation intent:

if "say" or "speak" → route to audio  
if "draw" or "image" → route to image  
if "video" or "clip" → route to video  
else → default to text

This decision is stored in the state as generation_type.

🧪 LangGraph Integration

The LangGraph StateGraph handles:
- Message state updates
- Routing to the appropriate generation function
Nodes:
- "router" → router_node
- "text", "audio", "image", "video" → respective generation functions

Example:

graph = StateGraph(GraphState)
graph.add_node("router", router_node)
graph.add_conditional_edges("router", 
    {
        "text": "generate_text",
        "audio": "generate_audio",
        "image": "generate_image",
        "video": "generate_video"
    }
)
graph.set_entry_point("router")
graph.set_finish_point("END")

🌐 Streamlit UI

Features:

Automatically sets page title and icon.
Loads models only once via @st.cache_resource.
Renders output conditionally:
- Text in st.markdown
- Audio in st.audio
- Image in st.image
- Video in st.video

🧪 Sample Prompt Use Cases

Prompt	Output
"Say 'Hello, welcome to the AI world!'"	Audio
"Draw a futuristic city with flying cars"	Image
"Generate a cinematic video of a dragon flying over mountains"	Video
"What is quantum computing?"	Text

✅ Achievements

Dynamic prompt refinement using LLMs.
Multi-modal routing logic with LangGraph.
Real-time TTS, image, and video output integration.
Robust error handling and fallback messaging.

📌 Future Enhancements

Add retry mechanism for video polling.
Allow user to choose between multiple image/video styles.
Integrate user upload for image-to-video tasks.
Add analytics/logging for usage tracking.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
app.py		app.py
final_audio.ipynb		final_audio.ipynb
final_image.ipynb		final_image.ipynb
final_video.ipynb		final_video.ipynb
image_google.ipynb		image_google.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test.py		test.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 Multimodal AI Assistant — Documentation

🚀 Overview

🧠 Architecture & Components

⚙️ Technologies Used

🛠️ Environment Setup

🧩 Modules & Functionality

🔐 `get_and_verify_keys()`

🧠 `initialize_models()`

🧬 `GraphState`

🔁 Modular Generation Functions

1️⃣ `generate_text_response(state)`

2️⃣ `generate_audio(state)`

3️⃣ `generate_image(state)`

4️⃣ `generate_video(state)`

🔀 Routing Logic

🧭 `router_node(state)`

🧪 LangGraph Integration

🌐 Streamlit UI

Features:

🧪 Sample Prompt Use Cases

✅ Achievements

📌 Future Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📄 Multimodal AI Assistant — Documentation

🚀 Overview

🧠 Architecture & Components

⚙️ Technologies Used

🛠️ Environment Setup

🧩 Modules & Functionality

🔐 get_and_verify_keys()

🧠 initialize_models()

🧬 GraphState

🔁 Modular Generation Functions

1️⃣ generate_text_response(state)

2️⃣ generate_audio(state)

3️⃣ generate_image(state)

4️⃣ generate_video(state)

🔀 Routing Logic

🧭 router_node(state)

🧪 LangGraph Integration

🌐 Streamlit UI

Features:

🧪 Sample Prompt Use Cases

✅ Achievements

📌 Future Enhancements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

🔐 `get_and_verify_keys()`

🧠 `initialize_models()`

🧬 `GraphState`

1️⃣ `generate_text_response(state)`

2️⃣ `generate_audio(state)`

3️⃣ `generate_image(state)`

4️⃣ `generate_video(state)`

🧭 `router_node(state)`

Packages