The Multimodal AI Assistant is a Streamlit-powered interactive system that supports text, image, audio, and video generation from a single input message. The assistant intelligently routes the user query to the appropriate generation model using a LangGraph-powered routing system. It supports:
- Natural language text generation (Gemini 1.5 Flash)
- Image creation (Gemini 2.0 Flash Image Model)
- Audio generation (ElevenLabs)
- Video generation (ModelsLab)
| Component | Tech / API |
|---|---|
| Text Generation | Google Gemini 1.5 Flash |
| Image Generation | Google Gemini 2.0 Flash (Image Preview) |
| Audio Generation | ElevenLabs API |
| Video Generation | ModelsLab CogVideoX |
| State Management | LangGraph (StateGraph + Router Node) |
| Web UI | Streamlit |
| Prompt Refinement | LLM-based System Prompts |
Ensure you have the following environment variables defined in a .env file:
GOOGLE_API_KEY=your_gemini_api_key
ELEVEN_API_KEY=your_elevenlabs_api_key
STABLE_DIFFUSION_API_KEY=your_modelslab_api_key- Validates presence of all required API keys.
- Stops execution if any are missing.
-
Loads and caches all model clients:
- Google Gemini LLM
- Google Gemini image model
- ElevenLabs client
-
Returns a centralized
modelsdictionary.
TypedDict structure defining state:
messages: List[BaseMessage]
generation_output: str
generation_type: str- Sends user messages to Gemini and returns AI response.
- Extracts "speakable" text using a system prompt.
- Uses ElevenLabs for TTS (Text-to-Speech).
- Saves output as
generated_audio.mp3.
- Uses Gemini 2.0 preview model to generate a base64 image.
- Saves and decodes to
generated_image.png.
- Refines the input with a system prompt.
- Calls ModelsLab's
/text2videoAPI. - Polls result and returns video URL.
- Uses basic keyword matching to infer generation intent:
if "say" or "speak" → route to audio
if "draw" or "image" → route to image
if "video" or "clip" → route to video
else → default to textThis decision is stored in the state as generation_type.
-
The LangGraph
StateGraphhandles:- Message state updates
- Routing to the appropriate generation function
-
Nodes:
"router"→router_node"text","audio","image","video"→ respective generation functions
Example:
graph = StateGraph(GraphState)
graph.add_node("router", router_node)
graph.add_conditional_edges("router",
{
"text": "generate_text",
"audio": "generate_audio",
"image": "generate_image",
"video": "generate_video"
}
)
graph.set_entry_point("router")
graph.set_finish_point("END")-
Automatically sets page title and icon.
-
Loads models only once via
@st.cache_resource. -
Renders output conditionally:
- Text in
st.markdown - Audio in
st.audio - Image in
st.image - Video in
st.video
- Text in
| Prompt | Output |
|---|---|
| "Say 'Hello, welcome to the AI world!'" | Audio |
| "Draw a futuristic city with flying cars" | Image |
| "Generate a cinematic video of a dragon flying over mountains" | Video |
| "What is quantum computing?" | Text |
- Dynamic prompt refinement using LLMs.
- Multi-modal routing logic with LangGraph.
- Real-time TTS, image, and video output integration.
- Robust error handling and fallback messaging.
- Add retry mechanism for video polling.
- Allow user to choose between multiple image/video styles.
- Integrate user upload for image-to-video tasks.
- Add analytics/logging for usage tracking.