This is a code for Streamlit-based application that implements the evaluation pipeline for Ideas Recall telegram bot and fully described in this article https://vvk93.substack.com/p/building-an-ai-powered-telegram-bot This evaluation pipeline was used to test prompts and AI capabilities and was later used for real telegram bot with some minor changes.
This project implements an automated evaluation pipeline that:
- Downloads transcripts from YouTube videos
- Generates summaries and flashcards using AI
- Evaluates the quality of generated content through multiple stages
- Provides detailed metrics and human feedback capabilities
- Automatic transcript download from YouTube videos
- Support for various video formats and languages
- Generates concise summaries from video transcripts
- Creates educational flashcards for key concepts
- Uses configurable AI models for generation
-
Stage 1: Automated Checks
- JSON format validation
- Length verification
- BERTScore semantic similarity check
-
Stage 2: AI Judge Assessment
- Accuracy evaluation
- Completeness scoring
- Relevance assessment
- Clarity measurement
-
Stage 3: Human Feedback
- User utility rating system
- Interactive feedback collection
- Real-time pipeline execution status
- Detailed evaluation metrics
- Historical run logs
- Configuration management
- Raw output inspection
- Frontend: Streamlit
- AI Models: OpenAI API
- Video Processing: YouTube API
- Evaluation Metrics: BERTScore, Custom AI Judge
-
Clone the repository
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables:
OPENAI_API_KEY: Your OpenAI API keyGOOGLE_API_KEY: Your Google API key
-
Run the application:
streamlit run app.py
The application allows customization of:
- AI model selection
- Evaluation thresholds
- Token limits
- System prompts
- Target scores
- Enter a YouTube URL in the sidebar
- Click "Download & Run Pipeline"
- View results across multiple tabs:
- Configuration & Prompts
- Generated Output
- Evaluation Results
- Run Log & History