Automatically converts horizontal videos into vertical format for TikTok, Instagram Reels, and YouTube Shorts.
Instead of a static center crop, the script analyzes each scene using AI (YOLOv8), detects people, and decides whether to crop tightly on the subjects or letterbox to preserve the full shot.
git clone https://github.com/kamilstanuch/AutoCrop-Vertical.git
cd AutoCrop-Vertical
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python3 main.py -i video.mp4 -o vertical.mp4The yolov8n.pt model weights are downloaded automatically on first run.
Prerequisites: Python 3.8+ and FFmpeg (ffmpeg + ffprobe) in your PATH.
# Basic — 9:16 vertical, balanced quality
python3 main.py -i video.mp4 -o vertical.mp4
# Instagram feed (4:5) with high quality
python3 main.py -i video.mp4 -o vertical.mp4 --ratio 4:5 --quality high
# Fast encode, square format
python3 main.py -i video.mp4 -o vertical.mp4 --ratio 1:1 --quality fast
# Preview the processing plan without encoding
python3 main.py -i video.mp4 -o vertical.mp4 --plan-only
# Full control over encoding parameters
python3 main.py -i video.mp4 -o vertical.mp4 --crf 20 --preset medium
# Use hardware encoder (macOS VideoToolbox / NVIDIA NVENC)
python3 main.py -i video.mp4 -o vertical.mp4 --encoder hw
# Maximum accuracy scene detection (slower)
python3 main.py -i video.mp4 -o vertical.mp4 --frame-skip 0Output:
| Flag | Default | Description |
|---|---|---|
-i, --input |
(required) | Path to input video |
-o, --output |
(required) | Path to output video (.mp4 appended if no extension) |
--ratio |
9:16 |
Output aspect ratio. Examples: 9:16, 4:5, 1:1 |
Encoding quality:
| Flag | Default | Description |
|---|---|---|
--quality |
balanced |
Preset: fast, balanced, or high (see table below) |
--encoder |
auto |
auto = libx264 (software), hw = hardware if available, or explicit name like h264_videotoolbox |
--crf |
(from quality) | Override CRF directly, 0-51 lower = better (libx264 only) |
--preset |
(from quality) | Override x264 preset directly: ultrafast..veryslow (libx264 only) |
Quality presets (libx264):
--quality |
CRF | Preset | Typical use |
|---|---|---|---|
fast |
28 | veryfast | Quick previews, drafts |
balanced |
23 | fast | Good quality, reasonable speed |
high |
18 | slow | Best quality, largest file, slowest |
Scene detection tuning:
| Flag | Default | Description |
|---|---|---|
--frame-skip |
0 |
Frames to skip during scene detection. 0 = every frame (most accurate). 1 = every other frame (~2x faster). Higher = faster but may miss cuts |
--downscale |
0 (auto) |
Downscale factor for scene detection. 0 = auto. 2-4 = faster but may miss subtle cuts |
Other:
| Flag | Default | Description |
|---|---|---|
--plan-only |
off | Run scene detection + analysis only, print the plan, exit without encoding |
- Content-Aware Cropping: YOLOv8 detects people and centers the vertical frame on them.
- Automatic Letterboxing: When people are too spread out for a vertical crop, black bars are added to preserve the full shot.
- Scene-by-Scene Processing: Decisions are made per scene for consistent, logical edits.
- Native Resolution: Output height matches the source to prevent quality loss from upscaling.
- Frame-Accurate Processing: Every frame is processed individually with the correct per-scene strategy — no timestamp rounding or scene boundary drift.
- Hardware Encoder Support: Optional
--encoder hwauto-detects VideoToolbox (macOS) or NVENC (NVIDIA) with automatic fallback to libx264. - VFR Handling: Variable frame rate sources are automatically normalized before processing.
- Audio Sync: Non-zero stream start times are detected and compensated to keep audio/video aligned.
Input Video
|
v
+-------------------------------+
| 1. Scene Detection | PySceneDetect splits the video into scenes
| (--frame-skip, --downscale)|
+---------------+---------------+
|
v
+-------------------------------+
| 2. Content Analysis | YOLOv8 detects people in each scene's
| (middle frame per scene) | middle frame; Haar cascade finds faces
+---------------+---------------+
|
v
+-------------------------------+
| 3. Strategy Decision | Per scene: TRACK (crop on subject)
| | or LETTERBOX (scale + black bars)
+---------------+---------------+
|
v
+-------------------------------+
| 4. Frame Processing | Per-frame crop/scale/pad via OpenCV
| (--quality, --encoder) | piped to FFmpeg for encoding
+---------------+---------------+
|
v
+-------------------------------+
| 5-6. Audio extract + merge | Audio synced with start-time offset
+---------------+---------------+
|
v
Output Video
Steps 1-3 are the "planning" phase (Python + AI). Step 4 applies the plan frame-by-frame and encodes via FFmpeg.
Benchmarks on Apple M1 MacBook Pro (AC power):
| Resolution | Duration | Total time | Speed |
|---|---|---|---|
| 1280x720 | 49s | ~6s | 8.3x real-time |
| 1920x1080 | 12 min | ~51s | 13.7x real-time |
Scene detection is the dominant bottleneck (~50% of total time).
This script is built on a pipeline that uses specialized libraries for each step:
-
Core Libraries:
PySceneDetect: For accurate, content-aware scene cut detection.Ultralytics (YOLOv8): For fast and reliable person detection.OpenCV: Used for frame manipulation, face detection (as a fallback), and reading video properties.FFmpeg/ffprobe: The backbone of video encoding, audio extraction, and media stream analysis.tqdm: For clean and informative progress bars in the console.
-
Processing Pipeline:
- (Pre-processing) If the source is VFR, it is normalized to constant frame rate.
PySceneDetectscans the video and returns a list of scene timestamps.- For each scene,
OpenCVextracts a sample frame andYOLOv8detects people in it. - A set of rules determines the strategy (
TRACKorLETTERBOX) for each scene based on the number and position of detected people. - OpenCV reads every frame sequentially. Each frame is cropped/resized (TRACK) or scaled/padded (LETTERBOX) according to its scene's strategy, then piped as raw pixels to FFmpeg for encoding. This frame-by-frame approach guarantees frame-accurate scene boundaries with no timestamp rounding errors.
- Audio is extracted separately (with start-time offset correction), then merged with the processed video.
- Fixed incorrect crop/letterbox decisions near scene boundaries. The v1.3
filter_complexpipeline used seconds-basedtrimfilters, which caused floating-point misalignment with the frame-based scene boundaries from PySceneDetect. This led to frames at scene transitions receiving the wrong strategy (e.g., a properly tracked person switching to letterbox mid-scene, or a group shot being cropped instead of letterboxed). Restored the original frame-by-frame processing pipeline which uses exact frame numbers for scene boundary matching, guaranteeing frame-accurate results.
New Features:
- Hardware encoder support (
--encoder). New flag with three modes:auto(libx264, default for best quality/compatibility),hw(auto-detect VideoToolbox on macOS or NVENC on NVIDIA), or an explicit encoder name. Quality presets (--quality) map automatically per encoder type. - Configurable scene detection (
--frame-skip,--downscale). Power users can tune the speed/accuracy trade-off. Default--frame-skip 0processes every frame for maximum accuracy; increase for faster detection on longer videos.
New Features:
- Configurable aspect ratio (
--ratio). Output is no longer locked to 9:16. Use--ratio 4:5for Instagram feed,--ratio 1:1for square, or any custom W:H ratio. - Quality presets (
--quality). Choose betweenfast(CRF 28, veryfast),balanced(CRF 23, fast — default), orhigh(CRF 18, slow). Power users can override directly with--crfand--preset. - Dry-run mode (
--plan-only). Runs scene detection and analysis only, prints the processing plan, and exits without encoding. Useful for previewing decisions before committing to a long encode. - Fixed output pixel format. Encoder now outputs
yuv420pinstead ofyuv444p, which is compatible with all players and platforms and produces smaller files. - Improved logging and progress reporting. Input file summary upfront (resolution, duration, fps, codec, file size, frame count), progress bars on all slow operations, and a final summary with output size, compression ratio, and processing speed.
Bug Fixes:
- Fixed audio/video desynchronization. This was caused by two separate issues:
- The frame rate was being read from PySceneDetect while frames were read by OpenCV. A mismatch between the two (e.g. 29.97 vs 30.0) caused the encoded video duration to drift from the audio. FPS is now read from OpenCV (the same backend that reads the frames) with explicit
-vsync cfrenforcement. - Many source files (especially YouTube downloads) have a non-zero
start_timeon the video stream (e.g. audio at 0.0s, video at 1.8s). The script now detects this offset viaffprobeand trims the extracted audio to match, so the two streams stay aligned.
- The frame rate was being read from PySceneDetect while frames were read by OpenCV. A mismatch between the two (e.g. 29.97 vs 30.0) caused the encoded video duration to drift from the audio. FPS is now read from OpenCV (the same backend that reads the frames) with explicit
- Fixed crash on videos without an audio stream. The script now detects whether audio exists using
ffprobeand skips the audio extraction/merge steps gracefully. - Fixed hardcoded
.aactemp audio file. The temp audio container is now.mkv, which accepts any audio codec. Previously, source files with non-AAC audio (MP3, Opus, AC3, etc.) could fail or produce corrupt output. - Fixed crash when output path has no file extension. The script now auto-appends
.mp4if no extension is provided. - Fixed orphaned temp files on failure. Temporary files are now cleaned up on all exit paths, not just on success.
Improvements:
- Variable frame rate (VFR) handling. Phone-recorded videos often use VFR, which caused frame timing drift. The script now detects VFR sources via
ffprobeand normalizes them to constant frame rate before processing. - Corrupt frame resilience. If a frame fails to process (bad crop, corrupt data), it is duplicated from the previous good frame instead of being dropped. This preserves the total frame count and prevents audio drift.
- Lazy model loading. YOLO and Haar cascade models are now loaded on first use instead of at import time. Heavy library imports (
torch,ultralytics,cv2, etc.) are deferred until after argument parsing, so--helpis instant. - Pinned dependency versions.
requirements.txtnow specifies compatible version ranges to prevent breakage from upstream changes. - Replaced
exit()withsys.exit(1). Ensures proper exit codes and reliable behavior in all environments.
