Skip to content

Dima slush demo#1

Open
dimakan-dev wants to merge 2 commits intomainfrom
dima-slush-demo
Open

Dima slush demo#1
dimakan-dev wants to merge 2 commits intomainfrom
dima-slush-demo

Conversation

@dimakan-dev
Copy link
Owner

@dimakan-dev dimakan-dev commented Mar 7, 2026

Added logic for parsing new podcast (Vector Podcast, but can be anything) from RSS feed, and transcribing with the Whisper model.

PR Summary

This branch expands the ingestion/transcription pipeline to support RSS-based workflows, timestamp-aware transcripts, and OpenSearch indexing improvements, while adding a large batch of podcast transcript content.

Branch Diff vs main

  • Commits ahead: 2
  • Files changed: 181
  • Insertions/Deletions: +123,848 / -25
  • Largest change category: transcript content additions and reorganizations

Key Code Changes

src/transcribe.py

  • Added lazy Whisper loading (model/import only when needed)
  • Added RSS transcription command (rss) with:
    • --latest, --all, --range
    • --show (subdirectory output)
    • --skip-if-exists
    • --with-timestamps (stores Whisper segments and outputs to transcripts_with_timestamps/)
  • Improved audio download behavior with browser-like headers and redirect visibility
  • Added output path helper logic and temp file cleanup

src/os_ingest.py

  • Added Typer CLI entrypoint with ingest command
  • Added ingestion modes for:
    • single episode
    • all episodes
    • show-scoped ingest
  • Added timestamp-enabled ingest (--with-timestamps) targeting separate index
  • Added YouTube URL/video ID extraction from description
  • Added chunk-to-Whisper-segment timestamp mapping
  • Improved publication date parsing fallback logic

src/os_index.py

  • Improved mappings:
    • title.keyword subfield for aggregation
    • episode_number as integer
    • added image_url
  • Added timestamp index creation flow with:
    • youtube_url
    • youtube_video_id
    • timestamp
    • chunk_index
  • Added mapping verification output and CLI behavior

src/quick_upload.py

  • Fixed function definition syntax (process_frontmatter missing colon)

New Scripts / Utilities

  • src/rss_parser.py — RSS feed parsing and episode lookup helpers
  • src/apply_rewrite_rules.py — batch transcript rewrite tool (supports in-place and timestamp dirs)
  • src/add_title_keyword_field.py — mapping update helper for title.keyword
  • src/count_episodes.py — unique episode counting helper in OpenSearch
  • test_rss_feasibility.py — RSS + audio-download feasibility test
  • rewrite_rules.json — transcription rewrite rule set
  • STREAMLIT_UPDATE_PROMPT.md — notes for adding timestamp-aware URL behavior in Streamlit

Content / Data Changes

  • Moved existing Conduit transcripts into transcripts/conduit_podcast/ (rename-only, R100)
  • Added new Vector Podcast transcripts under:
    • transcripts/vector-podcast/
    • transcripts_with_timestamps/vector-podcast/

Notes

  • Timestamp-enabled index coexists with the standard index
  • Existing non-timestamp workflows remain supported

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant