This project explores YouTube video performance using data-driven storytelling and machine learning inside Power BI.
It combines data visualization, predictive modeling, and clustering analysis to uncover what makes videos trend across regions and time zones.
The goal of this project is to understand what drives a YouTube video's success - from engagement metrics like likes and views to publishing time and regional trends.
By applying descriptive analytics, linear regression, and K-Means clustering, this dashboard turns raw YouTube metrics into meaningful insights and recommendations for content creators and marketers.
| Attribute | Description |
|---|---|
| Video ID | Unique identifier for each video |
| Title | Title of the YouTube video |
| Channel | Channel or creator name |
| Views | Total number of video views |
| Likes | Total number of likes received |
| Region | Country/region code (e.g., IN, US) |
| Published | Date & time of upload |
- ๐ฆ Source: YouTube Data API v3 (via REST API / Power BI web connector)
- ๐ Size: ~800 rows ร 7 columns
- ๐ Data refreshed dynamically via API or batch queries
Data cleaning and feature engineering were performed using Power Query and Python scripts embedded inside Power BI:
- Extracted Publish Hour from timestamp to study time-based engagement.
- Created Engagement Rate metric โ
Likes / Views. - Applied log-transformed scaling for numeric stability.
| Algorithm | Purpose | Description |
|---|---|---|
| Linear Regression | Predictive | Predicts views based on Likes, Region, and Publish Hour using scikit-learn |
| K-Means Clustering | Segmentation | Groups videos into High, Medium, and Low performers based on engagement |
The Power BI dashboard consists of three main sections:
Data Storytelling, Data Art, and Data Showcasing.
- Scatter Plot (Likes vs Views) โ Shows engagement correlation
- Bar Chart (Top 10 Channels by Views) โ Highlights leading creators
- Heatmap (Publish Hour vs Avg Views) โ Reveals peak posting hours
- Treemap (Channel Contribution by Region) โ Visualizes view share and geographic dominance
- Regression Model (Actual vs Predicted Views) โ Evaluates model accuracy
- K-Means Segmentation (Video Performance) โ Groups content by performance level
- ๐น Likes strongly correlate with views - engagement drives visibility.
- ๐น T-Series dominates the global landscape with over 34M views.
- ๐น Best publishing times: 1 AMโ3 AM (India region) for maximum reach.
- ๐น Regression model accurately predicts general view patterns, though extreme viral cases deviate.
- ๐น K-Means clustering reveals clear High/Medium/Low performer segments for strategic content targeting.
- Focus on high-engagement formats - niche content often yields loyal audiences.
- Post during region-specific peak hours to boost reach.
- Benchmark against top creators (e.g., T-Series, Universal Music India).
- Use clustering insights to tailor optimization strategies for underperforming videos.
- Microsoft Power BI Desktop
- Python (scikit-learn, pandas) via Power Query scripting
- DAX & Power Query for data transformation
- K-Means & Linear Regression for ML integration
(Add screenshots or exported visuals from your Power BI dashboard here)
| Visual | Description |
|---|---|
![]() |
Likes vs Views correlation |
![]() |
Optimal publish hours |
![]() |
Performance segmentation |
- Add comment sentiment and keyword trend analysis.
- Include watch time, shares, and comments metrics for deeper engagement modeling.
- Test Gradient Boosting and Neural Networks for improved prediction accuracy.
- Extend dataset to multi-year timeframes for trend forecasting.
Ei Ei Khaing
Graduate Certificate in Artificial Intelligence & Machine Learning | Fanshawe College
๐ง [ellenkhaing@gmail.com]


