Some cleanup by k5602 · Pull Request #21 · k5602/course_pilot

k5602 · 2026-01-26T20:39:39Z

This pull request introduces significant improvements to video module grouping and subtitle cleaning, along with several optimizations for CI and build configuration. The main highlights are a new title-aware algorithm for grouping videos into modules, enhanced subtitle cleaning to handle more formats and speaker labels, and several changes to reduce disk usage and improve build reliability in CI environments.

Video Grouping Improvements:

Implemented a new BoundaryDetector::group_by_titles method that groups videos into modules based on hierarchical and labeled numbering patterns in titles (e.g., "Module 2", "1.5.1 Topic"), with a fallback to batch grouping if the signal is weak. This leads to more accurate and meaningful module splits for courses with structured video titles. [1] [2] [3] [4] [5]
Updated local ingestion and playlist ingestion logic to use this new title-aware grouping, and improved sorting within groups using extracted number sequences from titles for more logical video ordering. [1] [2] [3]

Subtitle Cleaning Enhancements:

Extended the SubtitleCleaner to strip out additional metadata (e.g., "KIND:", "LANGUAGE:"), remove speaker labels (e.g., "[John]:", ">>"), and handle more subtitle formats, resulting in cleaner and more readable transcripts. [1] [2] [3]

Build and CI Optimizations:

Reduced debug info and optimized release profile in Cargo.toml to save disk space and prevent CI errors, including switching to Thin LTO, increasing codegen units, and stripping symbols.
Added a workflow step to free disk space on Linux CI runners by removing large unused directories.

Database and Schema:

Added a filter to diesel.toml to exclude search index tables from schema generation, preventing unnecessary type derivations.
Added a new migration to make the youtube_id column in the videos table nullable, with proper up and down SQL scripts for SQLite. [1] [2]

Other:

Minor: Removed trailing blank line from .env.example.

Implement BoundaryDetector::group_by_titles that detects labeled and dotted numbering (e.g. "Module 2", "1.5.1") and falls back to batch grouping when the signal is weak. Update ingest use cases to use title-based grouping. Add a GitHub Actions step to free disk space on Linux runners. Tune Cargo profiles: reduce debug info for dev/test, use Thin LTO, increase codegen-units and enable stripping to reduce link-time resource usage.

- SubtitleCleaner: detect more VTT headers (KIND, LANGUAGE, STYLE, NOTE), strip speaker labels (">>", "[NAME]:", "NAME:"), normalize whitespace, remove duplicate consecutive lines, simplify timestamp/timecode checks, strip inline tags, and expand unit tests. - Persistence: reorganize imports, use model row_to_* mappers, add ON CONFLICT DO UPDATE for courses and modules, and import VideoSource/YouTubeVideoId. - UI hooks: add backend_key and use_keyed_effect to prevent redundant effects; replace several use_effect usages with keyed effects keyed by backend pointer and resource ids.

Add SQL migrations to make videos.youtube_id nullable (up/down). Introduce LoadResult<T> for hooks and update UI to use .data/.state. Export title_number_sequence and sort local media by numeric sequences. Exclude FTS search_index tables from Diesel schema generation.

k5602 added 3 commits January 26, 2026 16:24

k5602 merged commit e2c7c96 into master Jan 26, 2026
1 check passed

k5602 deleted the some_cleanup branch January 26, 2026 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some cleanup#21

Some cleanup#21
k5602 merged 3 commits intomasterfrom
some_cleanup

k5602 commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

k5602 commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant