This project involves creating a movie recommendation system using a dataset containing movie attributes. The goal is to recommend movies similar to a given movie based on content similarity.
Dataset Link - https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
The dataset includes the following columns: budget, genres, homepage, id, keywords, original_language, original_title, overview, popularity, production_companies, runtime, spoken_languages, status, tagline, title, vote_average, and vote_count.
-
Handling Missing Values: Missing values were identified and handled. Specifically:
overviewhad 3 missing values, which were removed.- Other columns had no missing values.
- No duplicate entries were found in the dataset.
-
Data Transformation:
- Genres, Keywords, Cast, and Crew: Extracted and converted relevant information to a list format.
- Tags Creation: Combined
overview,genres,keywords,cast, andcrewinto a singletagscolumn for each movie. - Text Normalization: Applied stemming to the
tagscolumn using the Porter Stemmer to standardize the text data.
- Count Vectorization: Converted the
tagsinto numerical features usingCountVectorizerwith a maximum of 5000 features and excluding common stop words.
- Cosine Similarity: Calculated pairwise cosine similarity between movies based on their vectorized tags to quantify content similarity.
- Functionality:
- The
recommendfunction finds the most similar movies to a given movie based on cosine similarity scores. - It excludes the movie itself and returns the top 5 most similar movies.
- The
For the movie "Avatar", the system recommended:
- "Titan A.E."
- "Small Soldiers"
- "Independence Day"
- "Ender's Game"
- "Aliens vs Predator: Requiem"
This recommendation system effectively suggests movies similar in content to the input movie, enhancing user experience by providing relevant suggestions based on textual analysis.