Skip to content

kjaymiller/conduit-transcripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

193 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Conduit Podcast Transcripts

This is an archive of transcriptions generated by NVIDIA Parakeet, Audio Hijack, and related tools meant to be used as a source of data.

The text is from the Conduit Podcast

Getting Started

Installation

This project uses uv for fast, reliable Python package management. To set up:

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies and create virtual environment
uv sync

Environment Configuration

Load the environment variables:

# Using direnv (recommended)
direnv allow

# Or manually source the .envrc file
source .envrc

The .envrc file should contain connection strings for PostgreSQL and other configuration.

Usage

The project provides multiple interfaces for accessing and managing transcripts:

  • CLI Tool - conduit command for local operations
  • REST API - FastAPI server for programmatic access
  • MCP Server - Model Context Protocol server for Claude integration

Docker Setup

Start all services with Docker Compose:

# Start PostgreSQL and API server
docker compose up -d

# View logs
docker compose logs -f

# Stop services
docker compose down

The API will be available at http://localhost:8000 with interactive docs at /docs.

CLI Usage (via Docker)

docker compose run --rm app python -m cli.main [command] [options]

Transcription

Transcribe episodes from the Conduit website using NVIDIA Parakeet:

# Transcribe the latest episode
docker compose run --rm app python -m cli.main transcribe <episode_number>

# Transcribe and ingest (default)
docker compose run --rm app python -m cli.main transcribe <episode_number>

# Use specific model size/name (default: nvidia/parakeet-rnnt-1.1b)
docker compose run --rm app python -m cli.main transcribe <episode_number> --model nvidia/parakeet-rnnt-1.1b

# Configure the LLM model for RAG (Retrieval Augmented Generation)
docker compose run --rm app -e LLM_MODEL=llama3 python -m cli.main transcribe <episode_number>

Data Ingestion

Load transcripts into PostgreSQL:

# Ingest all files in transcripts directory
docker compose run --rm app python -m cli.main ingest

# Ingest specific file
docker compose run --rm app python -m cli.main ingest --file transcripts/episode1.md

# Recreate tables before ingestion
docker compose run --rm app python -m cli.main ingest --reindex

Search

Search through ingested transcripts:

# Text search (default)
docker compose run --rm app python -m cli.main search "search term"

# Vector semantic search
docker compose run --rm app python -m cli.main search "search phrase" --vector

MCP Server Usage

The project includes a Model Context Protocol (MCP) server that allows AI assistants (like Claude) to directly query the transcript database.

Configuration

Add the server to your MCP client configuration (e.g., claude_desktop_config.json):

{
  "mcpServers": {
    "conduit": {
      "command": "docker",
      "args": [
        "compose",
        "run",
        "--rm",
        "app",
        "python",
        "-m",
        "app.mcp.server"
      ]
    }
  }
}

Or if connecting to a running instance (e.g. via SSE):

{
  "mcpServers": {
    "conduit": {
        "url": "https://conduit.kjaymiller.dev/mcp/sse",
        "transport": "sse"
    }
  }
}

Available Tools

The MCP server provides the following tools:

  • search_transcripts: Search through transcripts using keyword or vector search.

    • query (string): The search text.
    • limit (int, optional): Max results (default 10).
    • use_vector (bool, optional): Use semantic vector search (default True).
    • episode_number (int, optional): Filter by episode number.
  • get_episode: Retrieve full content and metadata for an episode.

    • episode_number (int): The episode number to retrieve.
  • list_episodes: List available episodes with metadata.

    • limit (int, optional): Max results (default 20).
    • start_date (string, optional): Filter by start date (YYYY-MM-DD).
    • end_date (string, optional): Filter by end date (YYYY-MM-DD).

Management

# Check episode status
docker compose run --rm app python -m cli.main status <episode_number>

# List recent episodes
docker compose run --rm app python -m cli.main list

Project Structure

  • app/ - FastAPI application
    • api/ - REST API endpoints (search, episodes, health)
    • mcp/ - MCP Server for Claude integration
    • main.py - Main FastAPI application
  • cli/ - Command-line interface
  • podcast_transcription/ - Shared library code
    • database/ - Database operations (PostgreSQL)
    • models/ - SQLAlchemy models
    • transcription/ - Transcription logic (NVIDIA Parakeet)
    • utils/ - Shared utilities
  • transcripts/ - Generated markdown files with metadata and transcriptions

Troubleshooting

Virtual environment issues: Run uv sync and ensure you're using uv run or have activated the venv

Missing environment variables: Load .envrc with direnv allow or source .envrc

Parakeet model: First run downloads the default model (nvidia/parakeet-rnnt-1.1b) - requires network access

Database connection: Services are on Aiven; verify credentials and network access

Technology Stack

  • Python 3.13+
  • NVIDIA Parakeet (transcription)
  • LangChain (text processing)
  • PostgreSQL with pgvector extension
  • SQLAlchemy (ORM)
  • Click (CLI framework)
  • FastAPI (REST API)
  • MCP (Model Context Protocol)

Usage and License

Conduit Podcast Transcripts by Jay Miller, Kathy Campbell, original downloads from whisper work done by Pilix is licensed under Attribution-NonCommercial-ShareAlike 4.0 International

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors