A tool for automatically extracting transcripts from Loom videos using Selenium web automation.
This project provides a Python-based solution for scraping transcripts from Loom video URLs. It uses Selenium WebDriver to automate browser interactions, navigate to Loom video pages, and extract transcript content.
- Automated browser navigation to Loom video pages
- Transcript extraction from Loom's internal data structure
- Robust error handling with detailed debugging
- Support for processing multiple videos in batch
- LLM transcript processing capabilities for AI model ingestion
- Python 3.x
- Chrome or Firefox web browser
- Selenium WebDriver
- Required Python packages (see Installation)
-
Clone this repository:
git clone https://github.com/workingpleasewait/loom-transcript-scraper.git cd loom-transcript-scraper -
Install required packages:
pip3 install selenium webdriver_manager -
Ensure your browser driver is properly configured.
-
Add Loom video URLs to the
loom-videos.txtfile, one URL per line. -
Run the processor script:
python3 process.py -
Extracted transcripts will be saved in the
data/directory.
process.py- Main script for processing Loom videos and extracting transcriptsdebug.py- Helper script with debug functionalityloom-videos.txt- Input file containing Loom video URLs to processdata/- Directory where extracted transcripts are storeddebug_screenshots/- Directory for browser screenshots (for debugging)debug_output/- Directory for HTML page sources and button information (for debugging)integrated_solution.py- Integration script for LLM processingprocess_llm_integration.py- Handles LLM transcript processing integrationprocess_transcripts_for_llm.py- Processes transcripts for LLM ingestionREADME_LLM_INTEGRATION.md- Detailed guide for LLM integrationIMPLEMENTATION_GUIDE.md- Implementation guide for LLM processing
This project includes two special directories for debugging purposes:
This directory stores browser screenshots taken at various stages of the scraping process. These screenshots are valuable for troubleshooting when the script encounters issues with page rendering, element visibility, or automation steps.
The screenshots are named according to the processing stage and timestamp, allowing developers to trace the execution path visually.
This directory contains:
- HTML page sources for problematic pages
- JSON files with detailed information about buttons and interactive elements
- Shadow DOM information and other page structural details
These files provide context and data needed to debug complex web scraping issues, particularly when the target elements are located within shadow DOM or are dynamically loaded.
This project includes functionality to process Loom video transcripts for ingestion into Large Language Models (LLMs). The processing:
- Formats transcripts in a standardized way for optimal LLM consumption
- Cleans and structures text for better AI processing
- Supports both integrated and standalone usage approaches
For detailed information about using the LLM transcript processing features, refer to:
README_LLM_INTEGRATION.md- Contains integration guidelinesIMPLEMENTATION_GUIDE.md- Provides detailed implementation steps
Transcripts prepared for LLM ingestion are organized as follows:
- Raw transcripts are stored in the configured download directory
- LLM-ready transcripts are stored in the
llm_ready_transcripts/directory - Each LLM-ready file is named with the video title and ID, with an '_llm' suffix
- The directory structure preserves the relationship between original videos and their processed transcripts
If the script fails to extract a transcript:
- Check the
debug_screenshotsdirectory for visual references of the browser state - Examine the
debug_outputdirectory for page structure information - Ensure the Loom video has a transcript (not all videos do)
- Verify your browser driver is up to date
Contributions are welcome! Please feel free to submit a Pull Request.
This project is open source and available under the MIT License.