YouTube Pipeline

A flexible, open-source pipeline to process YouTube videos using Google's Gemini models. It supports transcript analysis, native video understanding (multimodality), batch processing, and structured outputs.

Features

Flexible Input: Process single videos, lists of URLs, or entire channels.
Advanced Filters: Filter channel videos by date (e.g., "1y"), duration (e.g., "medium"), and limit.
Multimodality:
- Transcript Mode: Fast, text-based analysis using video captions.
- Video Mode: Deep visual and audio analysis using Gemini's native YouTube support (no downloads required).
Structured Output: Get results in plain text or structured JSON using defined schemas.
Clean Interface: Simple Source, Command, Output class-based API.

Installation

Clone the repository:

git clone https://github.com/GtPluto/YoutubePipeline.git
cd YoutubePipeline

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate

Install the package:
```
pip install -e .
```

Configuration

Create a .env file in the root directory with your API keys:

YOUTUBE_API_KEY=your_youtube_api_key
GEMINI_API_KEY=your_gemini_api_key

YouTube API Key: Get it from Google Cloud Console.
Gemini API Key: Get it from Google AI Studio.

Usage

Basic Example

import os
from dotenv import load_dotenv
from yt_pipeline.core import Pipeline, Source, Command, Output

load_dotenv()

pipeline = Pipeline(
    youtube_api_key=os.getenv("YOUTUBE_API_KEY"),
    gemini_api_key=os.getenv("GEMINI_API_KEY")
)

# 1. Define Source
source = Source("https://www.youtube.com/watch?v=dQw4w9WgXcQ")

# 2. Define Command
command = Command(prompt="Summarize this video.", modality="transcript")

# 3. Define Output
output = Output(format="text")

# 4. Process
result = pipeline.process(source, command, output)
print(result)

Advanced Examples

Channel Batch Processing with Filters

Fetch the last 5 "medium" length videos from a channel published in the last year.

source = Source(
    value="https://www.youtube.com/@GoogleDevelopers",  # Google Developers Channel URL
    filters={
        "limit": 5,
        "duration": "medium", # short (<4m), medium (4-20m), long (>20m)
        "date": "1y",         # 1y, 30d, 24h, or YYYY-MM-DD
        "order": "date"
    }
)
command = Command(prompt="What is the main topic?", modality="transcript")
output = Output(format="text")

results = pipeline.process(source, command, output)
for res in results:
    print(f"[{res['title']}] {res['result']}")

Visual Analysis (Video Modality)

Analyze the visual content of a video directly (uses Gemini's native video understanding).

source = Source("https://www.youtube.com/watch?v=98DcoXwGX6I")
command = Command(
    prompt="Describe the visual style and key scenes.",
    modality="video" # Uses native video understanding
)
output = Output(format="text")

result = pipeline.process(source, command, output)
print(result)

Structured JSON Output

Force the model to return a specific JSON structure.

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "sentiment": {"type": "string", "enum": ["positive", "neutral", "negative"]},
        "tags": {"type": "array", "items": {"type": "string"}}
    },
    "required": ["title", "sentiment", "tags"]
}

source = Source("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
command = Command(prompt="Analyze sentiment and extract tags.", modality="transcript")
output = Output(format="json", schema=schema)

result = pipeline.process(source, command, output)
print(result)
# Output: {'title': '...', 'sentiment': 'positive', 'tags': ['music', '80s']}

API Reference

Classes

`Source`

Defines the input content and filtering rules.

Source(value: Union[str, List[str]], filters: Dict[str, Any] = None)

value: A single YouTube URL, a Channel ID (starting with UC), a Handle (e.g., @google), a Channel URL, or a list of them.
filters: A dictionary of filters to apply (mostly for channel fetching).

Supported Filters:

limit (int): Maximum number of videos to fetch from a channel (default: 10).
order (str): Sort order for channel videos. Values: "date", "rating", "relevance", "title", "videoCount", "viewCount".
date (str): Filter by publish date.
- Relative: "1y" (1 year), "30d" (30 days), "24h" (24 hours).
- Absolute: "YYYY-MM-DD".
duration (str): Filter by video duration.
- Values: "short" (<4m), "medium" (4-20m), "long" (>20m).
- Aliases: "<4m", "4-20m", ">20m".
language (str): Filter by language code (e.g., "en", "es").

`Command`

Defines the instruction and processing mode.

Command(prompt: str, modality: str = "transcript")

prompt: The natural language instruction for Gemini (e.g., "Summarize this").
modality: The method of analysis.
- "transcript": Uses the video captions. Fast and cost-effective. Best for speech-heavy content.
- "video": Uses Gemini's native video understanding. Analyzes visual frames and audio. Best for visual content.

`Output`

Defines the structure and format of the result.

Output(format: str = "text", schema: Any = None, destination: str = None)

format: "text" (default) or "json".
schema: A Python dictionary (JSON Schema) or Pydantic model defining the structure. Required if format="json".
destination: (Optional) Path to save the output.

`Pipeline`

`process`

Executes the pipeline.

def process(self, source: Source, command: Command, output: Output) -> Union[str, List[Dict]]

Returns:
- If processing a single video: The result string (or dict if JSON).
- If processing multiple videos (batch/channel): A list of dictionaries containing video_id, title, and result.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
src/yt_pipeline		src/yt_pipeline
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouTube Pipeline

Features

Installation

Configuration

Usage

Basic Example

Advanced Examples

Channel Batch Processing with Filters

Visual Analysis (Video Modality)

Structured JSON Output

API Reference

Classes

`Source`

`Command`

`Output`

`Pipeline`

`process`

About

Uh oh!

Releases

Packages

Languages

GtPluto/YoutubePipeline

Folders and files

Latest commit

History

Repository files navigation

YouTube Pipeline

Features

Installation

Configuration

Usage

Basic Example

Advanced Examples

Channel Batch Processing with Filters

Visual Analysis (Video Modality)

Structured JSON Output

API Reference

Classes

Source

Command

Output

Pipeline

process

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`Source`

`Command`

`Output`

`Pipeline`

`process`

Packages