Speaksee

AI-powered Speaker Following Subtitles

An AI system that detects the active speaker and renders dynamic, speech-bubble-style subtitles that follow each person on screen.

Speaksee: AI-Powered Speaker-Following Subtitles
Speaksee is an AI system that elevates video communication by automatically detecting who is speaking and displaying dynamic, speech-bubble-style subtitles that follow each speaker on screen. Whether in movies, TV dramas, recorded podcasts, or interviews, Speaksee helps audiences stay engaged by clearly linking every spoken line to the correct person — no more guessing who said what.

Leveraging multi-modal audio–visual processing, Speaksee identifies active speakers, tracks their position frame-by-frame, and renders clean, responsive subtitles in real time or during post-production. This makes content far easier to watch, especially in multi-speaker environments.

🎯 Features

1️⃣ Core Functionality

Dynamic Subtitle Placement — Speech-bubble-style subtitles that follow each speaker on screen
Multi-modal Active Speaker Detection — Accurately identifies speakers using both video and audio
Automatic STT & Segment Processing — Extracts text from audio and matches it to each speaker per segment
Non-intrusive Layout — Calculates optimal subtitle positions without obscuring other people’s faces

2️⃣ Customization Options Allows users to personalize subtitle appearance and layout to suit their preferences and viewing comfort.

Font Size & Thickness — Adjust text size and boldness
Font Color — Choose text color (BGR format)
Bubble Color & Transparency — Customize speech bubble background and opacity
Padding — Set distance between text and bubble border

⚙️ Argument & Usage

🎨 Customization Options Arguments

Video Processing Parameters

--videoName

Type: str
Default: "sample"
Description: Specifies the name of the demo video to process. Often used as the filename (e.g., sample.mp4) without the extension.

--videoFolder

Type: str
Default: "demo"
Description: Specifies the folder path where input videos, temporary files, and output result files will be stored.

Text Style Parameters

--fontSize

Type: float
Default: 1.0
Description: Sets the font size to apply to subtitles or text. Corresponds to the fontScale value used in OpenCV's cv2.putText.

--fontColor

Type: int, nargs=3
Default: [255, 255, 255] (B, G, R)
Description: Sets the text color. Input in BGR order, which is OpenCV's default color format.
- Example: [255, 255, 255] → White
- Example: [0, 255, 0] → Green

--thickness

Type: int
Default: 2
Description: Sets the thickness of text or outlines. Works identically to the thickness value in OpenCV's cv2.putText.

Speech Bubble Style Parameters

--bubbleColor

Type: int, nargs=3
Default: [0, 0, 0] (B, G, R)
Description: Background color of the speech bubble. Input in BGR order.
- Default [0, 0, 0] → Black

--bubbleAlpha

Type: float
Default: 0.7
Description: Represents the transparency (alpha value) of the speech bubble.
- 0.0 → Fully transparent
- 1.0 → Fully opaque
Controls the transparency ratio when the speech bubble is composited with the background video.

--padding

Type: int
Default: 10
Description: Sets how much spacing the text should have inside the speech bubble. Internal padding value of the bubble box.

💻 Usage Example

python speaksee.py --videoName "my_video" \
                 --videoFolder "videos" \
                 --fontSize 1.5 \
                 --fontColor 0 255 0 \
                 --thickness 3 \
                 --bubbleColor 50 50 50 \
                 --bubbleAlpha 0.8 \
                 --padding 15

👥 Contributors

Team Members

김하늘
이주승

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
model		model
utils		utils
weight		weight
.gitignore		.gitignore
ASD.py		ASD.py
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
dataLoader.py		dataLoader.py
loss.py		loss.py
renderer.py		renderer.py
speaksee		speaksee
speaksee.py		speaksee.py
stt_engine.py		stt_engine.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speaksee

AI-powered Speaker Following Subtitles

Table of Contents

🌟 About the Project

📓 Description

🎯 Features

⚙️ Argument & Usage

🎨 Customization Options Arguments

Video Processing Parameters

Text Style Parameters

Speech Bubble Style Parameters

💻 Usage Example

👥 Contributors

Team Members

📑 Academic Poster

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Speaksee

AI-powered Speaker Following Subtitles

Table of Contents

🌟 About the Project

📓 Description

🎯 Features

⚙️ Argument & Usage

🎨 Customization Options Arguments

Video Processing Parameters

Text Style Parameters

Speech Bubble Style Parameters

💻 Usage Example

👥 Contributors

Team Members

📑 Academic Poster

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages