Break Through Tech AI Studio β Google Challenge (YouTube Trending & Virality)
Business Context: YouTube's recommendation algorithm is central to video discovery and creator success. This project aligns with Google's mission to "organize the world's information and make it universally accessible and useful" by providing data-driven insights into video virality. By understanding the key factors that contribute to a video's trending potential, we can support:
- YouTube's Goals: Improving the accuracy and fairness of the recommendation algorithm to surface high-quality, engaging content.
- Creators' Strategy: Empowering creators with actionable insights to optimize their content and reach a wider audience.
Objective: To build, train, and validate a machine learning model that can successfully predict a YouTube video's likelihood of becoming trending based on its metadata and early engagement metrics.
| Name | GitHub Handle | Contributions |
|---|---|---|
| Brenda | BrendaG04 | ... |
| Shyla | shylabud | ... |
| Shahriar | Shahking | ... |
| Kristel | kristel777 | ... |
| Miles | Miles1744 | ... |
| Nancy Nakyung | @nancy1404 | ... |
| Rishika Vats | Irishsss | ... |
| Challenge Advisor: Woon Ket Wong | | Haziel Andrade |
- Analyzed 2.9M+ YouTube Trending records across 11 countries using the Kaggle YouTube Trending Video dataset.
- Framed virality as a binary classification task (top decile by views) and built models to predict whether a video will be viral vs. non-viral.
- Engineered time-to-trending and engagement-velocity features (likes/hour, comments/hour, engagement/hour) to capture how fast videos gain traction.
- Trained and compared multiple models (Logistic Regression, Random Forest, XGBoost, Naive Bayes) with strong ROC-AUC and recall on the viral class.
- Built per-country notebooks plus a Global notebook to study regional differences in virality and trending speed.
- Performed targeted error analysis (false positives/negatives) to uncover slow-burn virality and country-specific behaviors that raw metrics miss.
git clone https://github.com/BrendaG04/Google1D.git
cd Google1Dpython3 -m venv venv
source venv/bin/activate # on macOS / Linux
# .\venv\Scripts\activate # on Windows (PowerShell)If you have a requirements.txt:
pip install -r requirements.txtTypical packages used:
pandasnumpyscikit-learnmatplotlibseabornxgboostjupyter
- Go to the Kaggle dataset: βYouTube Trending Video Datasetβ.
- Download all country files (e.g.,
US_youtube_trending_data.csv,KR_youtube_trending_data.csv, etc.). - Place them in the
datasets/folder (this folder is not tracked in git due to file size):
youtube-trending-analysis/
βββ datasets/
β βββ US_youtube_trending_data.csv
β βββ CA_youtube_trending_data.csv
β βββ ...
β βββ KR_youtube_trending_data.csv
If you also use YouTube APIβbased files (e.g., category ID β category name mappings), place them in the same datasets/ folder.
jupyter notebookOpen:
notebooks/Global_Notebook.ipynbfor the global pipeline.notebooks/XX_Notebook.ipynbfor each country (US,CA,GB,DE,FR,BR,MX,IN,RU,JP,KR).
This project was completed as part of the Break Through Tech AI Studio program, in partnership with Google.
The challenge focused on understanding YouTube virality and Trending behavior across countries.
Core questions:
-
Virality
Can we predict which videos will become βviralβ (top 10% by views) using only early engagement and metadata? -
Trending Speed
Given a video that reaches Trending, can we estimate how long it takes to get there (e.g., days from publish to first Trending appearance)?
YouTube hosts millions of uploads per day β itβs impossible for humans to manually screen or prioritize them.
Understanding signals of virality and trending speed can help:
- Content teams design better posting strategies.
- Platforms monitor algorithmic amplification and potential bias.
- Creators interpret whether early signals are promising or not.
- Source: Kaggle β YouTube Trending Video Dataset
- Countries: 11 (e.g.,
US,CA,GB,DE,FR,BR,MX,IN,RU,JP,KR) - Rows: ~2.9 million total
Raw fields:
- Engagement:
view_count,likes,dislikes(legacy),comment_count - Metadata:
title,tags,channel_title,category_id,description - Time:
publish_time,trending_date - Other:
thumbnail_link,comments_disabled,ratings_disabled
Across notebooks, we typically:
- Parsed and aligned date/time columns (
publish_time,trending_date). - Removed or imputed rows with missing core fields (views, likes, comments).
- Dropped fields not usable for modeling (e.g., raw thumbnail links).
- Filtered out extreme outliers where needed when training regression models.
Some of the main engineered features included:
Time-based:
publish_hourpublish_dayofweekdays_to_trendingand/orhours_to_trending- (difference between
publish_timeand firsttrending_date)
- (difference between
Metadata richness:
title_length(characters)tag_count(number of tags)
Engagement-velocity (core to virality):
likes_per_hourcomments_per_hourengagement_per_hour(e.g.,(likes + comments) / hours_since_publish)
Virality label:
- For each country, defined a βviralβ flag as being in the top 10% of
view_count(or views per day) at the time of observation.
- Viral videos typically have much higher engagement velocity early on, not just more total views.
- Some markets (e.g., KR, IN, RU) show faster days-to-trending compared with others (e.g., US, CA, JP).
- Category effects exist but are often weaker than engagement and timing features.
We framed the work as two related tasks:
- Classification: Predict whether a video is viral (top decile) vs. non-viral.
- Regression: Estimate time-to-trending for videos that hit Trending.
Classification:
- Logistic Regression
- Random Forest Classifier
- XGBoost Classifier
- Naive Bayes (as a simpler baseline)
Regression:
- Linear Regression on log-transformed targets
- Random Forest Regressor
- XGBoost Regressor
- Train/validation split (e.g., 70/30) within each country or region.
- Scaling numeric features (e.g.,
StandardScaler) and one-hot encoding categorical fields usingColumnTransformer. - Main classification metric:
ROC-AUC, with recall on the viral class as a secondary focus. - Main regression metrics:
MAE,RMSE, andRΒ².
Because only ~10% of samples are labeled viral, we experimented with:
- Class weights (e.g.,
class_weight="balanced"). - Threshold tuning (moving away from 0.5) to recover better viral recall.
Note: Exact metrics may vary by country notebook; this section summarizes the overall behavior observed across runs.
- Tree-based models (Random Forest, XGBoost) consistently outperformed baselines.
- ROC-AUC for the best models was typically high (β 0.96+ in many regions), with strong separation between viral and non-viral classes.
- Engagement-velocity features (
likes_per_hour,comments_per_hour,engagement_per_hour) were almost always among the top feature importances.
- Best regression models achieved MAE ~1.5β2 days and moderate RΒ² (the platform has inherent randomness and unobserved factors).
- Some markets trend faster on average; others have more latency between publish and Trending, even for videos that eventually go viral.
Velocity > raw counts
A video with modest total views but high early engagement rate is more likely to be predicted viral than a slow-growing video with bigger absolute numbers.
Country differences:
- Certain countries show more βflashβ virality (quick spikes, fast Trending).
- Others exhibit slow-burn trajectories where videos accumulate views over time before finally hitting Trending.
Error analysis:
- False positives: High early engagement that never quite crosses the Trending threshold (e.g., niche but very loyal audiences).
- False negatives: Videos that start slow but later surge due to external events, news cycles, or creator promotion β behavior that isnβt fully captured by simple early-time features.
If we had more time or production constraints, we would explore:
- Use multilingual models (e.g., BERT variants) to embed titles, descriptions, and tags.
- Replace single βsnapshotβ features with time-series curves (engagement over 12β48 hours) and model them via RNN, TCN, or temporal transformers.
- Examine model performance across categories, countries, and channel sizes to see where predictions might systematically favor or penalize certain creators.
- Instead of only viral vs. non-viral, predict future peak views or watch time as a continuous outcome.
- Wrap the best model in a simple API and create a lightweight dashboard for βwhat-ifβ analyses (e.g., βWhat if we shift publish hour?β).
youtube-trending-analysis/
βββ notebooks/
β βββ Global_Notebook.ipynb
β βββ US_Notebook.ipynb
β βββ CA_Notebook.ipynb
β βββ GB_Notebook.ipynb
β βββ DE_Notebook.ipynb
β βββ FR_Notebook.ipynb
β βββ BR_Notebook.ipynb
β βββ MX_Notebook.ipynb
β βββ IN_Notebook.ipynb
β βββ RU_Notebook.ipynb
β βββ JP_Notebook.ipynb
β βββ KR_Notebook.ipynb
βββ datasets/ # (not tracked in git; add CSVs here locally)
βββ slides/ # final AI Studio presentation (to be added)
βββ README.md
βββ requirements.txt # (if used)
βββ .gitignore
βββ .DS_Store
This project is licensed under the MIT License.
- Kaggle: YouTube Trending Video Dataset
- Break Through Tech AI Studio curriculum materials (ML, MLOps, fairness modules)
- scikit-learn documentation
- XGBoost documentation
- (Add any additional papers, blogs, or resources you used.)
Huge thanks to:
- Google / YouTube for sponsoring the challenge.
- Break Through Tech AI for the curriculum, mentorship, and infrastructure.
- Team Google 1D members and coaches for feedback on modeling, EDA, and communication.