Skip to content

dimakan-dev/conduit-transcripts

 
 

Repository files navigation

Conduit Podcast Transcripts

This is an archive of transcriptions generated by Whisper, Audio Hijack, and related tools meant to be used as a source of data.

The text is from the Conduit Podcast

Getting Started

Installation

This project uses uv for fast, reliable Python package management. To set up:

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies and create virtual environment
uv sync

### Environment Configuration

Load the environment variables:

```bash
# Using direnv (recommended)
direnv allow

# Or manually source the .envrc file
source .envrc

The .envrc file should contain connection strings for OpenSearch, PostgreSQL, and other configuration.

Usage

All commands below use uv run for environment isolation. If you've activated the virtual environment, you can omit uv run.

Transcription

Transcribe episodes from the Conduit website using OpenAI Whisper:

# Transcribe the latest episode
uv run python src/transcribe.py ep

# Transcribe specific episodes
uv run python src/transcribe.py ep 100 101 102

# Transcribe a range of episodes
uv run python src/transcribe.py ep --range 100-105

# Transcribe all episodes (with confirmation)
uv run python src/transcribe.py ep --all

# Transcribe a local audio file
uv run python src/transcribe.py file path/to/audio.mp3 --output path/to/output.txt

Data Ingestion

Load transcripts into PostgreSQL and/or OpenSearch:

# Load all transcripts into both databases
uv run python src/quick_upload.py files

# Load specific files
uv run python src/quick_upload.py files --file transcripts/episode1.md --file transcripts/episode2.md

# Load with index recreation (destroys existing OpenSearch index)
uv run python src/quick_upload.py files --reindex

# Load to PostgreSQL only
uv run python src/quick_upload.py files --pg-only

# Load to OpenSearch only
uv run python src/quick_upload.py files --os-only

Index Management

# Create or recreate OpenSearch index
uv run python src/os_index.py

Project Structure

  • src/ - Application code
    • transcribe.py - Whisper transcription and episode processing
    • url_finder.py - Web scraping for episode metadata and audio URLs
    • os_ingest.py - OpenSearch data ingestion
    • os_index.py - OpenSearch index creation
    • pg_ingest.py - PostgreSQL data processing with embeddings
    • quick_upload.py - Unified data loader for both databases
    • download_audio_file.py - Audio file download utility
  • transcripts/ - Generated markdown files with metadata and transcriptions

Troubleshooting

Virtual environment issues: Run uv sync and ensure you're using uv run or have activated the venv

Missing environment variables: Load .envrc with direnv allow or source .envrc

Whisper model: First run downloads the "base" model (~140MB) - requires network access

Database connection: Services are on Aiven; verify credentials and network access

Table recreation: pg_ingest.py and quick_upload.py can drop tables - use --reindex carefully

Technology Stack

  • Python 3.12.5
  • OpenAI Whisper (transcription)
  • LangChain (text processing)
  • PostgreSQL with pgvector extension
  • OpenSearch (vector search)
  • SQLAlchemy (ORM)
  • Typer (CLI framework)

Usage and License

Conduit Podcast Transcripts by Jay Miller, Kathy Campbell, original downloads from whisper work done by Pilix is licensed under Attribution-NonCommercial-ShareAlike 4.0 International

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 86.3%
  • Just 11.1%
  • Shell 2.6%