🔗 Live Demo: genai-project-6b5bre75bfmupqpa8npwz5.streamlit.app
This project converts images into engaging audio stories using image captioning, text generation, and browser-based speech synthesis.
The app performs the following steps:
- 🖼️ Image-to-Text: Captions images using
Salesforce/blip-image-captioning-base. - 📜 Text-to-Story: Expands captions into stories using
GPT-2. - 🔊 Text-to-Speech: Converts stories into audio using browser's built-in SpeechSynthesis API.
- Python: Core backend processing
- Streamlit: Frontend UI framework
- Hugging Face Transformers: Access to GPT-2 and image models
- Hugging Face Inference API: For accessing models like BLIP and GPT-2
- BLIP (Salesforce): Image captioning model
- Browser SpeechSynthesis: In-browser TTS using JavaScript (no external TTS API needed)
git clone https://github.com/fahad10inb/GenAI-Project.git
cd GenAI-Projectpip install -r requirements.txtstreamlit run app.py- Upload an image.
- View the AI-generated caption.
- Generate a story from the caption.
- Listen to the story using browser-based audio playback.
No need for external models or installations — audio is generated using the browser’s built-in SpeechSynthesis API.
💡 Works out of the box on Chrome, Edge, and Firefox with natural voices.
- Improve caption-to-story creativity with fine-tuned LLMs.
- Add multilingual support for narration.
- Allow custom voice selection and speech rate control.
- Optional export of audio to downloadable
.wavusing ESPnet locally.
- GitHub: Fahad10inb
- Email: fahadrahiman10@gmail.com