Microsoft Studios Movie Analysis
Authors: Danielle Rossman, Nick Indorf, Jeff Marvel
Summary:
This project analyzes movie data from the last ten years (2010-2019) to determine what characteristics have driven box office success as measured by Return on Investment percentage (ROI%). The project culminated in three business recommendations for the launch of Microsoft's new movie studio. Presentation here
Data and sources:
IMDB datasets: multiple datasets on general movie info, genre, ratings, and staff (director, actor, etc) from the past 10 years. The Numbers: data on profitability (worlwide gross, production budget) for each movie MovieTweeting: repository of "well-structured" tweets that contain a movie review
Navigating the repository:
In the main directory:
data_cleaning_final.jpynb: Run this first. notebook that consolidates datasets and performs other cleaning steps. Outputs a clean CSV file that is used in the final analysis.
Movie_analysis_final.jpynb: notebook that performs analysis that underlines each recommendation. Leverages final CSVs produced by the Data Cleaning notebook. Provides a rationale for each of the recommendations and prints / saves the final graphs used in the presentation.
In the zippedData directory:
The original datasets for IMDB:
imdb.title.basics.csv.gzimdb.title.ratings.csv.gzimdb.title.principals.csv.gzimdb.title.crew.csv.gzimdb.title.basics.csv.gzimdb.name.basics.csv.gz
The original datasets for The Numbers:
tn.movie_budgets.csv.gz
The original datasets for MovieTweeting:
movies.datratings.dat
In the cleanData directory:
The cleaned datasets found below
In the images directory:
Figures generated in the recommendation analyses, used in the final presentation.
Approach and methodology:
IMDB sub-datasets were first merged together to create a large composite IMDB dataset.
- saved in
cleanData/cleanIMDB/imdb_comp.csv - name keys save as
cleanData/cleanIMDB/imdb_namekey.csvThe Numbers dataset was processed to adjust profit and cost for inflation, as well as ROI% - saved in the
cleanDatadirectory astheNumbers_clean.csvThe IMDB composite and The Numbers were merged based on a concatenation of year and movie title (to prevent movies with the same name being included twice). - saved in the
cleanDatadirectory asimdb_combined_prof.csv
Further data cleaning steps were taken, including:
-
Removing a handful of Null values
-
Removing outliers
- ROI% > 6000 (2 movies) - skewed dataset too highly
- Worldwide Gross < $100K (52 movies) - potentially just a limited run with low scope - not relevant to Microsoft's interests
-
For the analysis of twitter reviews, removed movies that received under 50 reviews (to prevent skew in the results)
- saved in the
cleanDatadirectory astwitter_reviews_clean.csv
- saved in the
This yielded a composite dataset for downstream analysis, saved in the cleanData directory as imdb_combined_prof.
A histogram of ROI distributions is provided, saved in the images as comp_data_hist.png
Recommendation #1: Horror / Thriller movies tend to perform significantly better than average at the box office.
Grouped the dataset by genre and sorting by average ROI. Horror, Mystery, and Thriller movies stand out from the pack in terms of ROI. These movies have an ROI that is 108% higher than average. This is not necessarily surprising since Horror movies, while they may not win any oscars, consistently draw large crowds and perform well. The below graph illustrates this relationship.
Recommendation #2: Target 6 "to-hire" movie cast/crew who consistently outperform their peers.
Grouped the dataset by person in each role (meaning job in the cast/crew), then sorted to find the top performers. This subset was filtered for people who have worked on at least 2 and at least 3 movies (to remove 'one-hit wonders'). A set of scatterplots was generated, with x-axis as # of movies and y-axis as ROI% Then, a set of box plots was generated for the top 5 performers in each role For all plots, averages of # of movies and ROI% is added to the graphs, where applicable. In the set of box plots for 2+ movies worked, 6 cast/crew members outperformed their peers significantly.
Recommendation #3: Focus on making good movies and invest in social media presence.
Details: Our dataset included two sources of reviews. 1. Reviews on IMDB.com and 2. Reviews from "well-structured" tweets on Twitter.
The first step of the analysis was to determine if there is a relationship between review quality and profitability. Our analysis suggests that there is. For movies with "good" reviews (defined as 8+ on IMDB), ROI is 68% higher than average. This was particularly true for Horror / Thriller movies, which had 100% higher ROI than average. The reason for using 8 as the cutoff for "good" is that the population average is around ~7. People don't appear to leave bad reviews. This result suggest that there is real value, particularly for Horror movies, in clearing the bar of a 8 or higher review.
The second step of the analysis was to determine if social media engagement drove higher ROI. The hypothesis is that total number of tweets, as a rough proxy for "social media" buzz, would lead to higher ROI. The data bears this hypothesis out. Movies with high twitter engagement (defined as 500+ well-structured tweets), materially outperformed average ROI. This is particularly true when you look at more recent data. For movies made since 2017, high engagement on Twitter suggest an ROI that is 115% higher than average. The punchline is that as social media becomes more and more embedded in our lives, this relationship only becomes more important.
Conclusion:
- Horror, Mystery, and Thriller movies pay
- There is a subset of movie talent that consistently produce high ROI movies
- Good reviews and social media engagement matter for the eventual box office success of a new movie
Further analysis:
- Interaction of Genre and Cast/Crew
- Twitter analysis with a more complete dataset, as the original was only for "well-structured" tweets