Skip to content

Movie recommendation system (content-based) using TF-IDF on overviews and a feature “soup” (keywords, cast, director, genres) + similarity (linear_kernel/cosine). Includes EDA and examples.

Notifications You must be signed in to change notification settings

JulianAndresPrego/Movie-Recommendation-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🎬 Movie Recommendation System — Content-Based

👥 Contributors

Julián Prego · María Remírez · Jesús Roldán

🇺🇸 English

This project implements a content-based movie recommendation system. We provide two recommenders:

  1. Overview-based TF-IDF: recommends the 10 most similar movies using the textual overview and a vector space similarity (Scikit-learn linear_kernel).
  2. Feature “soup”: combines keywords, cast, director, and genres into a single text field and uses cosine similarity to reorder/refine the top results.

🔍 Dataset & EDA

  • Source: two files with film information, merged on a shared id. Columns include: id, title_x, cast, crew, budget, genres, homepage, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title_y, vote_average, vote_count.
  • Language distribution: ~93.8% English vs 6.2% others (French/Spanish next most common).
  • Status/rating/budget note: unreleased titles are excluded; films rated >5 tend to have higher average budgets (descriptive).

🧰 Methods

  • Preprocessing (Overview model): drop rows without overview, build TF-IDF vectors, compute pairwise similarity with linear_kernel, and select top-N (heapsort).
  • Preprocessing (Soup model): convert stringified lists to Python lists, normalize tokens, build the soup (keywords + cast + director + genres), compute cosine similarity, and reorder the initial candidates.

📦 Project structure

/Proyecto_Final.ipynb      # Jupyter notebook with the full pipeline
/Massive Data Mining Project.pptx  # Slides summary
/img/movies.jpg            # (optional) banner image for README

▶️ How to run

  1. Open Proyecto_Final.ipynb y ejecuta las celdas en orden (Python 3.x, scikit-learn, pandas, numpy).
  2. Provide a movie title to the recommender function to get top-10 similar titles (overview model), then see the re-rank with the soup model.

🧪 Example (illustrative)

# Overview-based recommendations (TF-IDF + linear_kernel)
titles = df['title_x']  # or the normalized title column present in your dataset
idx = title_to_index["Inception"]
sim_scores = list(enumerate(linear_kernel(tfidf_matrix[idx], tfidf_matrix).flatten()))
top10 = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:11]
recommended = [titles[i] for i, _ in top10]

# Soup-based re-ranking (cosine similarity on combined features)
idx = title_to_index["Inception"]
sim_scores_soup = list(enumerate(cosine_similarity(soup_matrix[idx], soup_matrix).flatten()))
top10_soup = sorted(sim_scores_soup, key=lambda x: x[1], reverse=True)[1:11]
recommended_soup = [titles[i] for i, _ in top10_soup]

✅ What you get

  • Two complementary content-based recommenders.
  • Clean EDA insights (language mix, status filters, descriptive rating/budget note).
  • Clear notebook to reproduce the pipeline end-to-end.

⚠️ Notes & limitations

  • Purely content-based (no user ratings/behavior).
  • Quality depends on overview and metadata completeness.
  • Consider deduplication/title normalization to handle remakes and aliases.

🚀 Future work

  • Hybrid with collaborative filtering (user/item CF).
  • Neural text embeddings (e.g., transformer encoders) for richer semantics.
  • Bias control (e.g., language/popularity priors) and diversity in top-N.

🇮🇹 Italiano

Progetto di sistema di raccomandazione di film content-based con due approcci:

  • TF-IDF sulla sinossi (overview) e similarità con linear_kernel per restituire i 10 film più simili.
  • “Zuppa” di feature (keywords, cast, regista, generi) con cosine similarity per riordinare/rifinire i risultati.

Dataset & EDA.

Merge di due file sul campo id; colonne principali: id, title_x, cast, crew, budget, genres, homepage, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title_y, vote_average, vote_count. Inoltre, ~93,8% dei film sono in inglese (francese e spagnolo seguono) e si escludono titoli non rilasciati; i film con rating >5 mostrano budget medi più alti (descrittivo).

Metodi.

Overview: rimozione record senza testo, TF-IDF, linear_kernel, selezione top-N (heapsort). Zuppa: parsing stringhe→liste, normalizzazione token, costruzione soup, cosine similarity e riordino.

Come eseguire.

Apri Proyecto_Final.ipynb, installa dipendenze (pandas, numpy, scikit-learn) e lancia le celle. Cerca un titolo per ottenere i top-10 e il riordino con la zuppa.

About

Movie recommendation system (content-based) using TF-IDF on overviews and a feature “soup” (keywords, cast, director, genres) + similarity (linear_kernel/cosine). Includes EDA and examples.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published