⚡ Build structured YouTube datasets for NLP, ML, sentiment analysis & RAG in minutes.
A python tool for fetching thousands of videos fast from a Youtube channel along with structured transcripts and additional metadata. Export data easily as CSV, TXT, or JSON.
- Installation
- Quick CLI Usage
- Basic Usage (Python API)
- Features
- Fetching Specific Channel Tabs (Videos / Shorts / Streams)
- Using Different Fetchers
- Retreive Different Languages
- Filtering
- Converting ChannelData to Rows
- SQLite Cache
- Fetching Only Manually Created Transcripts
- Exporting
- Comments
- Other Methods
- Proxy Configuration
- Advanced HTTP Configuration (Optional)
- CLI (Advanced)
- Docker Quick Start
- Contributing
- Running Tests
- Related Projects
- License
- Contributors
Install from PyPI:
pip install ytfetcherFetch 50 video transcripts + metadata from a channel and save as JSON:
ytfetcher channel TheOffice -m 50 -f jsonHere’s how you can get transcripts and metadata information like channel name, description, published date, etc. from a single channel with from_channel method:
from ytfetcher import YTFetcher
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=2
)
channel_data = fetcher.fetch_youtube_data()
for video in channel_data:
print(video.metadata.title)
print(video.metadata.description)
print(video.transcripts)This will return a list of ChannelData with metadata in DLSnippet objects:
[
ChannelData(
video_id='video1',
transcripts=[
Transcript(
text="Hey there",
start=0.0,
duration=1.54
),
Transcript(
text="Happy coding!",
start=1.56,
duration=4.46
)
]
metadata=DLSnippet(
video_id='video1',
title='VideoTitle',
description='VideoDescription',
url='https://youtu.be/video1',
duration=120,
view_count=1000,
thumbnails=[{'url': 'thumbnail_url'}]
)
),
# Other ChannelData objects...
]You can also preview this data using PreviewRenderer class from ytfetcher.services.
from ytfetcher.services import PreviewRenderer
channel_data = fetcher.fetch_with_comments(max_comments=10)
#print(channel_data)
preview = PreviewRenderer()
preview.render(data=channel_data, limit=4)This will preview the first 4 results of the data in a beautifully formatted terminal view, including metadata, transcript snippets, and comments.
- Fetch full transcripts from a YouTube channel.
- Get video metadata: title, description, thumbnails, published date.
- Support for fetching with channel handle, playlist id, custom video id's or with a search query.
- Fetch comments in bulk.
- Concurrent fetching for high performance.
- Built in cache support.
- Export fetched data as txt, csv or json.
- CLI support.
Use the tab parameter in from_channel() to select which section of a channel to fetch.
Available options:
'videos'(default)'shorts''streams'
If not specified, the fetcher defaults to the Videos tab.
# Fetch regular videos (default)
YTFetcher.from_channel(channel_handle="handle")
# Fetch Shorts
YTFetcher.from_channel(channel_handle="handle", tab="shorts")
# Fetch live streams
YTFetcher.from_channel(channel_handle="handle", tab="streams")ytfetcher supports various fetching options that includes:
- Fetching from a playlist id with
from_playlist_idmethod. - Fetching from video id's with
from_video_idsmethod. - Fetching from a search query with
from_searchmethod.
Use from_playlist_id to retrieve metadata and transcripts for every video within a public or unlisted YouTube playlist.
from ytfetcher import YTFetcher
fetcher = YTFetcher.from_playlist_id(
playlist_id="playlistid1254"
)
# Rest is same ...If you already have specific video identifiers, from_video_ids allows you to target them directly.
This is the most efficient way to fetch data when you have an external list of URLs or IDs.
from ytfetcher import YTFetcher
fetcher = YTFetcher.from_video_ids(
video_ids=['video1', 'video2', 'video3']
)
# Rest is same ...The from_search method allows you to discover videos based on a keyword or phrase, similar to using the YouTube search bar. You can control the breadth of the search using the max_results parameter.
from ytfetcher import YTFetcher
# Searches for the top 10 videos matching 'Artificial Intelligence'
fetcher = YTFetcher.from_search(
query="Artificial Intelligence",
max_results=10
)YTFetcher provides a simple interface for customizing your fetching process with several optional parameters:
- languages: Specify preferred transcript languages (e.g.,
["en", "tr"]). - filters: Apply filters to video metadata before transcripts are fetched.
- manually_created Fetch only manually created transcripts for more precise transcripts.
- proxy_config Provide custom proxy settings for preventing bans.
- http_config Define custom http headers.
- cache_enabled Enable or disable SQLite transcript cache. Enabled by default.
- cache_path Choose where cache file (
cache.sqlite3) is stored.
These options can be passed to any of the fetcher methods (from_channel, from_video_ids, from_playlist_id, or from_search) to tailor the fetching process for your needs. You can use FetchOptions dataclass from ytfetcher.config for easily configure your options.
See below for examples of usages.
You can use the languages param to retrieve your desired language. (Default en)
from ytfetcher.config import FetchOptions
options = FetchOptions(
languages=['tr', 'en']
)
fetcher = YTFetcher.from_video_ids(video_ids=video_ids, options=options)Also here's a quick CLI command for languages param.
ytfetcher channel TheOffice -m 50 -f csv --languages tr enytfetcher first tries to fetch the Turkish transcript. If it's not available, it falls back to English.
ytfetcher allows you to filter videos before fetching transcripts, which helps you focus on specific content and save processing time. Filters are applied to video metadata (duration, view count, title) and work with all fetcher methods.
The following filter functions are available in ytfetcher.filters:
min_duration(sec: float)- Filter videos with duration greater than or equal to specified secondsmax_duration(sec: float)- Filter videos with duration less than or equal to specified secondsmin_views(n: int)- Filter videos with view count greater than or equal to specified numbermax_views(n: int)- Filter videos with view count less than or equal to specified numberfilter_by_title(search_query: str)- Filter videos whose title contains the search query (case-insensitive)
Pass a list of filter functions to the filters parameter when creating a fetcher:
from ytfetcher import YTFetcher
from ytfetcher.config import FetchOptions
from ytfetcher.filters import min_duration, min_views, filter_by_title
options = FetchOptions(
filters=[
min_views(5000),
min_duration(600), # At least 10 minutes
filter_by_title("tutorial")
]
)
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=50,
options=options
)You can use filter arguments directly in the CLI:
# Filter by minimum views
ytfetcher channel TheOffice -m 50 -f json --min-views 1000
# Filter by minimum duration (in seconds)
ytfetcher channel TheOffice -m 50 -f csv --min-duration 300
# Filter by title substring
ytfetcher channel TheOffice -m 50 -f json --includes-title "episode"
# Combine multiple filters
ytfetcher channel TheOffice -m 50 -f json --min-views 1000 --min-duration 300 --includes-title "tutorial"If you want a flat, row-based structure for ML workflows (Pandas, HuggingFace datasets, JSON/Parquet), you can use the helper in ytfetcher.utils to join transcript segments. Comments are only included if you fetched them with fetch_with_comments or fetch_comments.
from ytfetcher import YTFetcher
from ytfetcher.utils import channel_data_to_rows
fetcher = YTFetcher.from_channel(channel_handle="TheOffice", max_results=2)
channel_data = fetcher.fetch_with_comments(max_comments=5)
rows = channel_data_to_rows(channel_data, include_comments=True)ytfetcher now uses a local SQLite cache for transcripts. This significantly speeds up repeated fetches by reusing transcripts that were already fetched with the same transcript options.
sfrom ytfetcher import YTFetcher
from ytfetcher.config import FetchOptions
options = FetchOptions(
cache_enabled=True,
cache_path="./.ytfetcher_cache"
)
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=20,
options=options,
)Disable cache when needed:
from ytfetcher.config import FetchOptions
options = FetchOptions(cache_enabled=False)Control cache expiration with TTL (days):
from ytfetcher.config import FetchOptions
# Keep cached transcripts for 3 days
options = FetchOptions(cache_ttl=3)
# Disable expiration entirely
options = FetchOptions(cache_ttl=0)Use --no-cache to skip reading/writing cache for a command:
ytfetcher channel TheOffice -m 20 --no-cache -f jsonSet a custom cache directory:
ytfetcher channel TheOffice -m 20 --cache-path ./my_cache -f jsonSet cache TTL in days (0 disables expiration):
ytfetcher channel TheOffice -m 20 --cache-ttl 3 -f jsonClear cached transcripts:
ytfetcher cache --cleanOr clear a custom cache path:
ytfetcher cache --clean --cache-path ./my_cacheytfetcher allows you to fetch only manually created transcripts from a channel which allows you to get more precise transcripts.
from ytfetcher import YTFetcher
from ytfetcher.config import FetchOptions
options = FetchOptions(
manually_created=True
)
fetcher = YTFetcher.from_channel(channel_handle="TEDx", options=options)You can also easily enable this feature with --manually-created argument in CLI.
ytfetcher channel TEDx -f csv --manually-createdUse the BaseExporter class to export ChannelData in csv, json, or txt:
from ytfetcher.services import JSONExporter # OR you can import other exporters: TXTExporter, CSVExporter
channel_data = fetcher.fetch_youtube_data()
exporter = JSONExporter(
channel_data=channel_data,
allowed_metadata_list=['title'], # You can customize this
timing=True, # Include transcript start/duration
filename='my_export', # Base filename
output_dir='./exports' # Optional output directory
)
exporter.write()You can also specify arguments when exporting which allows you to decide whether to exclude timings and choose desired metadata.
ytfetcher channel TheOffice -m 20 -f json --no-timing --metadata title descriptionThis command will exclude timings from transcripts and keep only title and description as metadata.
ytfetcher allows you fetch comments in bulk with additional metadata and transcripts or just comments alone.
Performance: Comment fetching is a resource-intensive process. The speed of extraction depends significantly on the user's internet connection and the total volume of comments being retrieved.
To fetch comments alongside with transcripts and metadata you can use fetch_with_comments method.
fetcher = YTFetcher.from_channel("TheOffice", max_results=5)
channel_data_with_comments = fetcher.fetch_with_comments(max_comments=10)This will simply fetch top 10 comments for every video alongside with transcript data.
Here's an example structure:
[
ChannelData(
video_id='id1',
transcripts=list[Transcript(...)],
metadata=DLSnippet(...),
comments=list[Comment(
text='Comment one.',
like_count=20,
author='@author',
time_text='8 days ago'
)]
)
]To fetch comments without transcripts you can use fetch_comments method.
fetcher = YTFetcher.from_channel("TheOffice", max_results=5)
comments = fetcher.fetch_comments(max_comments=20)This will return list of Comment like this:
[
Comment(
text='Comment one.',
like_count=20,
author='@author',
time_text='8 days ago'
)
## OTHER COMMENT OBJECTS...
]Fetching comments in ytfetcher with CLI is very easy.
To fetch comments with transcripts you can use --comments argument:
ytfetcher channel TheOffice -m 20 --comments 10 -f jsonTo fetch only comments with metadata you can use --comments-only argument:
ytfetcher channel TheOffice -m 20 --comments-only 10 -f jsonYou can also fetch only transcript data or metadata with video IDs using fetch_transcripts and fetch_snippets.
fetcher = YTFetcher.from_channel(channel_handle="TheOffice", max_results=2)
data = fetcher.fetch_transcripts()
print(data)data = fetcher.fetch_snippets()
print(data)YTFetcher supports proxy usage for fetching YouTube transcripts:
from ytfetcher import YTFetcher
from ytfetcher.config import GenericProxyConfig, WebshareProxyConfig, FetchOptions
options = FetchOptions(
proxy_config=GenericProxyConfig() | WebshareProxyConfig()
)
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=3,
options=options
)YTfetcher already uses custom headers for mimic real browser behavior but if you want to change it, you can use a custom HTTPConfig class.
from ytfetcher import YTFetcher
from ytfetcher.config import HTTPConfig, FetchOptions
custom_config = HTTPConfig(
headers={"User-Agent": "ytfetcher/1.0"}
)
options = FetchOptions(
http_config=custom_config
)
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=10,
options=options
)YTFetcher comes with a simple CLI so you can fetch data directly from your terminal.
ytfetcher -husage: ytfetcher [-h] {channel,playlist,video,search} ...
Fetch YouTube transcripts for a channel
positional arguments:
{channel,playlist,video,search}
channel Fetch data from channel handle with max_results.
playlist Fetch data from a specific playlist id.
video Fetch data from your custom video ids.
search Fetch data from youtube with search query.
options:
-h, --help show this help message and exitytfetcher channel <CHANNEL_HANDLE> -m <MAX_RESULTS> -f <FORMAT>Use --tab to choose which channel feed should be fetched.
# Default: videos
ytfetcher channel TheOffice -m 20 --tab videos -f json
# Fetch from the Shorts tab
ytfetcher channel TheOffice -m 20 --tab shorts -f json
# Fetch from the Live/Streams tab
ytfetcher channel TheOffice -m 20 --tab streams -f json
### Fetching by Video IDs
```bash
ytfetcher video video_id1 video_id2 ... -f jsonytfetcher playlist playlistid123 -f csv -m 25ytfetcher search "AI Getting Jobs" -f json -m 25ytfetcher <CHANNEL_HANDLE> -f json --webshare-proxy-username "<USERNAME>" --webshare-proxy-password "<PASSWORD>"ytfetcher <CHANNEL_HANDLE> -f json --http-proxy "http://user:pass@host:port" --https-proxy "https://user:pass@host:port"The recommended way to run or develop YTFetcher is using Docker to ensure a clean, stable environment without needing local Python or dependency management.
docker-compose buildUse docker-compose run to execute your desired command inside the container.
docker-compose run ytfetcher poetry run ytfetcher channel -c TheOffice -m 20 -f jsongit clone https://github.com/kaya70875/ytfetcher.git
cd ytfetcher
poetry installpoetry run pytestYou should be passing all type checks to contribute ytfetcher.
poetry run mypy ytfetcherThis project is licensed under the MIT License — see the LICENSE file for details.
Thanks to everyone who has contributed to ytfetcher ❤️
⭐ If you find this useful, please star the repo or open an issue with feedback!