Skip to content

Repository creating linkable diarized transcriptions of youtube recordings. Used for board meetings, etc.

License

Notifications You must be signed in to change notification settings

awfulbananas/transcripts

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

179 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transcripts for SPS By the Numbers

This repository provides a pipeline for diarizing and transcribing youtube videos, and then publishing the results on the web in a way that enables stable-URL citation.

Having such a repository will allow searching all historical tools using a combination of existing search engines or more modern tools like ChatGPT which can instructed to ingest the transcripts and perform all the textual analysis that LLMs are particuarly good at.

Furthermore, having a repository of transcripts can allow for on-demand translation. Having the translation and original transcript linked to the video is important over just raw text because it allows one to access context such as tone of voice or facial expressions for sections of particular interest.

Web Serving stack

This site is built using Next.js and hosted on Google Firebase. It is very low resource consumption beyond storage as most of the content is static. The choice of Firebase was due to free tier features. As it is a very basic nearly static site, the code could be adopted to nearly any tech stack.

Transcription pipeline.

Transcription is done with the WhisperX project which uses a combination of OpenAI Whisper large-v2 model as well as huggingface's pyannote speaker-diarization-3.0 models.

Contraty to initial expectation, this does NOT require a powerful GPU. What it does require is a lot of GPU memory (> 10GB) for Whisper's Large-V2. Then it requires a lot of CPU cores (16+ recommended) and lots of ram (32Gb or more) to parallelize the diarization cluster and prevent swapping.

If you do not use the GPU, have too few CPU cores, or swap, your runtimes of transcription+diarization will go from a few minutes per meeting to many many hours.

The pipeline was run on machines from vast.ai. It costs very little per meeting. Doing the entire archive of Seattle School Board and Seattle City Council (around 2000 videos, most a couple of hours long) took about $70 which included many failed runs while fixing scripts. The machines used often had $0.2/hr rental times.

Getting started for development

TODO: Document firebase setup

npm install
npm run dev &
npx tsc --watch

Production site is https://transcripts.sps-by-the-numbers.com/

Deployment

Github should do the deploy automatically via the workflow but if you want to do it manually.

npx firebase deploy

About

Repository creating linkable diarized transcriptions of youtube recordings. Used for board meetings, etc.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TypeScript 54.5%
  • JavaScript 20.5%
  • Python 12.4%
  • CSS 4.4%
  • SCSS 3.3%
  • Shell 2.6%
  • MDX 2.3%