Merged
Conversation
Implement BoundaryDetector::group_by_titles that detects labeled and dotted numbering (e.g. "Module 2", "1.5.1") and falls back to batch grouping when the signal is weak. Update ingest use cases to use title-based grouping. Add a GitHub Actions step to free disk space on Linux runners. Tune Cargo profiles: reduce debug info for dev/test, use Thin LTO, increase codegen-units and enable stripping to reduce link-time resource usage.
- SubtitleCleaner: detect more VTT headers (KIND, LANGUAGE, STYLE, NOTE), strip speaker
labels (">>", "[NAME]:", "NAME:"), normalize whitespace, remove duplicate consecutive lines,
simplify timestamp/timecode checks, strip inline tags, and expand unit tests.
- Persistence: reorganize imports, use model row_to_* mappers, add ON CONFLICT DO UPDATE for
courses and modules, and import VideoSource/YouTubeVideoId.
- UI hooks: add backend_key and use_keyed_effect to prevent redundant effects; replace several
use_effect usages with keyed effects keyed by backend pointer and resource ids.
Add SQL migrations to make videos.youtube_id nullable (up/down). Introduce LoadResult<T> for hooks and update UI to use .data/.state. Export title_number_sequence and sort local media by numeric sequences. Exclude FTS search_index tables from Diesel schema generation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces significant improvements to video module grouping and subtitle cleaning, along with several optimizations for CI and build configuration. The main highlights are a new title-aware algorithm for grouping videos into modules, enhanced subtitle cleaning to handle more formats and speaker labels, and several changes to reduce disk usage and improve build reliability in CI environments.
Video Grouping Improvements:
BoundaryDetector::group_by_titlesmethod that groups videos into modules based on hierarchical and labeled numbering patterns in titles (e.g., "Module 2", "1.5.1 Topic"), with a fallback to batch grouping if the signal is weak. This leads to more accurate and meaningful module splits for courses with structured video titles. [1] [2] [3] [4] [5]Subtitle Cleaning Enhancements:
SubtitleCleanerto strip out additional metadata (e.g., "KIND:", "LANGUAGE:"), remove speaker labels (e.g., "[John]:", ">>"), and handle more subtitle formats, resulting in cleaner and more readable transcripts. [1] [2] [3]Build and CI Optimizations:
Cargo.tomlto save disk space and prevent CI errors, including switching to Thin LTO, increasing codegen units, and stripping symbols.Database and Schema:
diesel.tomlto exclude search index tables from schema generation, preventing unnecessary type derivations.youtube_idcolumn in thevideostable nullable, with proper up and down SQL scripts for SQLite. [1] [2]Other:
.env.example.