Skip to content

sidhyaashu/multimodal-chatbot

Repository files navigation

📄 Multimodal AI Assistant — Documentation

🚀 Overview

The Multimodal AI Assistant is a Streamlit-powered interactive system that supports text, image, audio, and video generation from a single input message. The assistant intelligently routes the user query to the appropriate generation model using a LangGraph-powered routing system. It supports:

  • Natural language text generation (Gemini 1.5 Flash)
  • Image creation (Gemini 2.0 Flash Image Model)
  • Audio generation (ElevenLabs)
  • Video generation (ModelsLab)

🧠 Architecture & Components

⚙️ Technologies Used

Component Tech / API
Text Generation Google Gemini 1.5 Flash
Image Generation Google Gemini 2.0 Flash (Image Preview)
Audio Generation ElevenLabs API
Video Generation ModelsLab CogVideoX
State Management LangGraph (StateGraph + Router Node)
Web UI Streamlit
Prompt Refinement LLM-based System Prompts

🛠️ Environment Setup

Ensure you have the following environment variables defined in a .env file:

GOOGLE_API_KEY=your_gemini_api_key
ELEVEN_API_KEY=your_elevenlabs_api_key
STABLE_DIFFUSION_API_KEY=your_modelslab_api_key

🧩 Modules & Functionality

🔐 get_and_verify_keys()

  • Validates presence of all required API keys.
  • Stops execution if any are missing.

🧠 initialize_models()

  • Loads and caches all model clients:

    • Google Gemini LLM
    • Google Gemini image model
    • ElevenLabs client
  • Returns a centralized models dictionary.

🧬 GraphState

TypedDict structure defining state:

messages: List[BaseMessage]
generation_output: str
generation_type: str

🔁 Modular Generation Functions

1️⃣ generate_text_response(state)

  • Sends user messages to Gemini and returns AI response.

2️⃣ generate_audio(state)

  • Extracts "speakable" text using a system prompt.
  • Uses ElevenLabs for TTS (Text-to-Speech).
  • Saves output as generated_audio.mp3.

3️⃣ generate_image(state)

  • Uses Gemini 2.0 preview model to generate a base64 image.
  • Saves and decodes to generated_image.png.

4️⃣ generate_video(state)

  • Refines the input with a system prompt.
  • Calls ModelsLab's /text2video API.
  • Polls result and returns video URL.

🔀 Routing Logic

🧭 router_node(state)

  • Uses basic keyword matching to infer generation intent:
if "say" or "speak"route to audio  
if "draw" or "image"route to image  
if "video" or "clip"route to video  
elsedefault to text

This decision is stored in the state as generation_type.


🧪 LangGraph Integration

  • The LangGraph StateGraph handles:

    • Message state updates
    • Routing to the appropriate generation function
  • Nodes:

    • "router"router_node
    • "text", "audio", "image", "video" → respective generation functions

Example:

graph = StateGraph(GraphState)
graph.add_node("router", router_node)
graph.add_conditional_edges("router", 
    {
        "text": "generate_text",
        "audio": "generate_audio",
        "image": "generate_image",
        "video": "generate_video"
    }
)
graph.set_entry_point("router")
graph.set_finish_point("END")

🌐 Streamlit UI

Features:

  • Automatically sets page title and icon.

  • Loads models only once via @st.cache_resource.

  • Renders output conditionally:

    • Text in st.markdown
    • Audio in st.audio
    • Image in st.image
    • Video in st.video

🧪 Sample Prompt Use Cases

Prompt Output
"Say 'Hello, welcome to the AI world!'" Audio
"Draw a futuristic city with flying cars" Image
"Generate a cinematic video of a dragon flying over mountains" Video
"What is quantum computing?" Text

✅ Achievements

  • Dynamic prompt refinement using LLMs.
  • Multi-modal routing logic with LangGraph.
  • Real-time TTS, image, and video output integration.
  • Robust error handling and fallback messaging.

📌 Future Enhancements

  • Add retry mechanism for video polling.
  • Allow user to choose between multiple image/video styles.
  • Integrate user upload for image-to-video tasks.
  • Add analytics/logging for usage tracking.

About

Multimodal AI Assistant

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors