Skip to content

Some cleanup#21

Merged
k5602 merged 3 commits intomasterfrom
some_cleanup
Jan 26, 2026
Merged

Some cleanup#21
k5602 merged 3 commits intomasterfrom
some_cleanup

Conversation

@k5602
Copy link
Owner

@k5602 k5602 commented Jan 26, 2026

This pull request introduces significant improvements to video module grouping and subtitle cleaning, along with several optimizations for CI and build configuration. The main highlights are a new title-aware algorithm for grouping videos into modules, enhanced subtitle cleaning to handle more formats and speaker labels, and several changes to reduce disk usage and improve build reliability in CI environments.

Video Grouping Improvements:

  • Implemented a new BoundaryDetector::group_by_titles method that groups videos into modules based on hierarchical and labeled numbering patterns in titles (e.g., "Module 2", "1.5.1 Topic"), with a fallback to batch grouping if the signal is weak. This leads to more accurate and meaningful module splits for courses with structured video titles. [1] [2] [3] [4] [5]
  • Updated local ingestion and playlist ingestion logic to use this new title-aware grouping, and improved sorting within groups using extracted number sequences from titles for more logical video ordering. [1] [2] [3]

Subtitle Cleaning Enhancements:

  • Extended the SubtitleCleaner to strip out additional metadata (e.g., "KIND:", "LANGUAGE:"), remove speaker labels (e.g., "[John]:", ">>"), and handle more subtitle formats, resulting in cleaner and more readable transcripts. [1] [2] [3]

Build and CI Optimizations:

  • Reduced debug info and optimized release profile in Cargo.toml to save disk space and prevent CI errors, including switching to Thin LTO, increasing codegen units, and stripping symbols.
  • Added a workflow step to free disk space on Linux CI runners by removing large unused directories.

Database and Schema:

  • Added a filter to diesel.toml to exclude search index tables from schema generation, preventing unnecessary type derivations.
  • Added a new migration to make the youtube_id column in the videos table nullable, with proper up and down SQL scripts for SQLite. [1] [2]

Other:

  • Minor: Removed trailing blank line from .env.example.

k5602 added 3 commits January 26, 2026 16:24
Implement BoundaryDetector::group_by_titles that detects labeled and dotted
numbering (e.g. "Module 2", "1.5.1") and falls back to batch grouping when the
signal is weak. Update ingest use cases to use title-based grouping.

Add a GitHub Actions step to free disk space on Linux runners.

Tune Cargo profiles: reduce debug info for dev/test, use Thin LTO, increase
codegen-units and enable stripping to reduce link-time resource usage.
- SubtitleCleaner: detect more VTT headers (KIND, LANGUAGE, STYLE, NOTE), strip speaker
  labels (">>", "[NAME]:", "NAME:"), normalize whitespace, remove duplicate consecutive lines,
  simplify timestamp/timecode checks, strip inline tags, and expand unit tests.
- Persistence: reorganize imports, use model row_to_* mappers, add ON CONFLICT DO UPDATE for
  courses and modules, and import VideoSource/YouTubeVideoId.
- UI hooks: add backend_key and use_keyed_effect to prevent redundant effects; replace several
  use_effect usages with keyed effects keyed by backend pointer and resource ids.
Add SQL migrations to make videos.youtube_id nullable (up/down).
Introduce LoadResult<T> for hooks and update UI to use .data/.state.
Export title_number_sequence and sort local media by numeric sequences.
Exclude FTS search_index tables from Diesel schema generation.
@k5602 k5602 merged commit e2c7c96 into master Jan 26, 2026
1 check passed
@k5602 k5602 deleted the some_cleanup branch January 26, 2026 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant