Open
Conversation
…iption quality and support of multiple podcasts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Added logic for parsing new podcast (Vector Podcast, but can be anything) from RSS feed, and transcribing with the Whisper model.
PR Summary
This branch expands the ingestion/transcription pipeline to support RSS-based workflows, timestamp-aware transcripts, and OpenSearch indexing improvements, while adding a large batch of podcast transcript content.
Branch Diff vs
mainKey Code Changes
src/transcribe.pyrss) with:--latest,--all,--range--show(subdirectory output)--skip-if-exists--with-timestamps(stores Whisper segments and outputs totranscripts_with_timestamps/)src/os_ingest.pyingestcommand--with-timestamps) targeting separate indexsrc/os_index.pytitle.keywordsubfield for aggregationepisode_numberasintegerimage_urlyoutube_urlyoutube_video_idtimestampchunk_indexsrc/quick_upload.pyprocess_frontmattermissing colon)New Scripts / Utilities
src/rss_parser.py— RSS feed parsing and episode lookup helperssrc/apply_rewrite_rules.py— batch transcript rewrite tool (supports in-place and timestamp dirs)src/add_title_keyword_field.py— mapping update helper fortitle.keywordsrc/count_episodes.py— unique episode counting helper in OpenSearchtest_rss_feasibility.py— RSS + audio-download feasibility testrewrite_rules.json— transcription rewrite rule setSTREAMLIT_UPDATE_PROMPT.md— notes for adding timestamp-aware URL behavior in StreamlitContent / Data Changes
transcripts/conduit_podcast/(rename-only,R100)transcripts/vector-podcast/transcripts_with_timestamps/vector-podcast/Notes