This project involves a comprehensive analysis of Netflix's movies and TV shows data using SQL. The goal is to extract valuable insights and answer various business questions based on the dataset. The following README provides a detailed account of the project's objectives, business problems, solutions, findings, and conclusions.
- Analyze the distribution of content types (movies vs TV shows).
- Identify the most common ratings for movies and TV shows.
- List and analyze content based on release years, countries, and durations.
- Explore and categorize content based on specific criteria and keywords.
The data for this project is sourced from the Kaggle dataset:
- Dataset Link: Movies Dataset
CREATE TABLE IF NOT EXISTS netflix_titles
(
show_id VARCHAR(10) PRIMARY KEY,
typess VARCHAR(10),
title VARCHAR(250),
director VARCHAR(550),
casts VARCHAR(1050),
country VARCHAR(550),
date_added VARCHAR(55),
release_year INT,
rating VARCHAR(15),
duration VARCHAR(15),
listed_in VARCHAR(250),
description VARCHAR(550)
);SELECT
typess,COUNT(*)
FROM netflix_titles
GROUP BY 1;Objective: Determine the distribution of content types on Netflix.
SELECT
typess,rating
FROM (
SELECT
typess,rating, count(*),
RANK () OVER(PARTITION BY typess ORDER BY count(*) desc) as ranking
FROM netflix_titles
GROUP BY 1,2
ORDER BY 1
) AS T1
WHERE ranking = 1;Objective: Identify the most frequently occurring rating for each type of content.
SELECT
title, casts, director
FROM netflix_titles
WHERE
typess = 'Movie' AND release_year = 2020;Objective: Retrieve all movies released in a specific year.
SELECT
UNNEST(STRING_TO_ARRAY(country, ',')) as new_country,
COUNT(*) as total_content
FROM netflix_titles
GROUP BY 1
ORDER BY total_content DESC
LIMIT 5;Objective: Identify the top 5 countries with the highest number of content items.
SELECT
*
FROM netflix_titles
WHERE typess = 'Movie'
ORDER BY SPLIT_PART(duration, ' ', 1)::INT DESC;Objective: Find the movie with the longest duration.
SELECT
*
FROM netflix_titles
WHERE TO_DATE(date_added, 'Month DD, YYYY') >= CURRENT_DATE - INTERVAL '5 years';Objective: Retrieve content added to Netflix in the last 5 years.
SELECT * FROM netflix_titles
WHERE director ilike 'Rajiv Chilaka'
-- OR
SELECT
*
FROM(
SELECT
UNNEST(STRING_TO_ARRAY(director, ',')) as new_director, *
FROM netflix_titles
) WHERE new_director = 'Rajiv Chilaka';Objective: List all content directed by 'Rajiv Chilaka'.
SELECT
*
FROM netflix_titles
WHERE typess = 'TV Show'
AND SPLIT_PART(duration, ' ', 1):: int > 5;Objective: Identify TV shows with more than 5 seasons.
SELECT
UNNEST(STRING_TO_ARRAY(listed_in,',')) AS genre,
count(*)
FROM netflix_titles
GROUP BY 1
ORDER BY 2 desc;
Objective: Count the number of content items in each genre.
return top 5 year with highest avg content release!
SELECT
country,
release_year,
COUNT(show_id) as total_release,
ROUND(
COUNT(show_id)::numeric/
(SELECT COUNT(show_id) FROM netflix_titles WHERE country = 'India')::numeric * 100
,2
)
as avg_release
FROM netflix_titles
WHERE country = 'India'
GROUP BY country, 2
ORDER BY avg_release DESC
LIMIT 5;Objective: Calculate and rank years by the average number of content releases by India.
SELECT *
FROM (
SELECT
UNNEST(STRING_TO_ARRAY(listed_in,',')) as listed,
*
FROM netflix_titles
) as t1
WHERE listed = 'Documentaries';Objective: Retrieve all movies classified as documentaries.
SELECT
*
FROM netflix_titles
WHERE director IS NULL;Objective: List content that does not have a director.
SELECT
*
FROM (
SELECT
UNNEST (STRING_TO_ARRAY(casts,',')) as actors, *
FROM netflix_titles
) WHERE actors = 'Salman Khan'
AND
release_year >= (EXTRACT(YEAR FROM CURRENT_DATE) -10);
-- OR
SELECT * FROM netflix_titles
WHERE
casts LIKE '%Salman Khan%'
AND
release_year > EXTRACT(YEAR FROM CURRENT_DATE) - 10;Objective: Count the number of movies featuring 'Salman Khan' in the last 10 years.
SELECT
UNNEST (STRING_TO_ARRAY(casts,',')) as actors,count(*)
FROM netflix_titles
WHERE country = 'India'
GROUP BY 1
ORDER BY 2 desc
LIMIT 10;Objective: Identify the top 10 actors with the most appearances in Indian-produced movies.
SELECT
*,
CASE
WHEN description ILIKE '%kill%'
OR
description ILIKE '%violence%' THEN 'Bad_Content'
ELSE 'Good_Content'
END category
FROM netflix_titles;Objective: Categorize content as 'Bad' if it contains 'kill' or 'violence' and 'Good' otherwise. Count the number of items in each category.
- Content Distribution: The dataset contains a diverse range of movies and TV shows with varying ratings and genres.
- Common Ratings: Insights into the most common ratings provide an understanding of the content's target audience.
- Geographical Insights: The top countries and the average content releases by India highlight regional content distribution.
- Content Categorization: Categorizing content based on specific keywords helps in understanding the nature of content available on Netflix.
This analysis provides a comprehensive view of Netflix's content and can help inform content strategy and decision-making.
Developed by psyccho00
