🎬 Movie Recommendation System — Content-Based

👥 Contributors

Julián Prego · María Remírez · Jesús Roldán

🇺🇸 English

This project implements a content-based movie recommendation system. We provide two recommenders:

Overview-based TF-IDF: recommends the 10 most similar movies using the textual overview and a vector space similarity (Scikit-learn linear_kernel).
Feature “soup”: combines keywords, cast, director, and genres into a single text field and uses cosine similarity to reorder/refine the top results.

🔍 Dataset & EDA

Source: two files with film information, merged on a shared id. Columns include: id, title_x, cast, crew, budget, genres, homepage, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title_y, vote_average, vote_count.
Language distribution: ~93.8% English vs 6.2% others (French/Spanish next most common).
Status/rating/budget note: unreleased titles are excluded; films rated >5 tend to have higher average budgets (descriptive).

🧰 Methods

Preprocessing (Overview model): drop rows without overview, build TF-IDF vectors, compute pairwise similarity with linear_kernel, and select top-N (heapsort).
Preprocessing (Soup model): convert stringified lists to Python lists, normalize tokens, build the soup (keywords + cast + director + genres), compute cosine similarity, and reorder the initial candidates.

📦 Project structure

/Proyecto_Final.ipynb      # Jupyter notebook with the full pipeline
/Massive Data Mining Project.pptx  # Slides summary
/img/movies.jpg            # (optional) banner image for README

▶️ How to run

Open Proyecto_Final.ipynb y ejecuta las celdas en orden (Python 3.x, scikit-learn, pandas, numpy).
Provide a movie title to the recommender function to get top-10 similar titles (overview model), then see the re-rank with the soup model.

🧪 Example (illustrative)

# Overview-based recommendations (TF-IDF + linear_kernel)
titles = df['title_x']  # or the normalized title column present in your dataset
idx = title_to_index["Inception"]
sim_scores = list(enumerate(linear_kernel(tfidf_matrix[idx], tfidf_matrix).flatten()))
top10 = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:11]
recommended = [titles[i] for i, _ in top10]

# Soup-based re-ranking (cosine similarity on combined features)
idx = title_to_index["Inception"]
sim_scores_soup = list(enumerate(cosine_similarity(soup_matrix[idx], soup_matrix).flatten()))
top10_soup = sorted(sim_scores_soup, key=lambda x: x[1], reverse=True)[1:11]
recommended_soup = [titles[i] for i, _ in top10_soup]

✅ What you get

Two complementary content-based recommenders.
Clean EDA insights (language mix, status filters, descriptive rating/budget note).
Clear notebook to reproduce the pipeline end-to-end.

⚠️ Notes & limitations

Purely content-based (no user ratings/behavior).
Quality depends on overview and metadata completeness.
Consider deduplication/title normalization to handle remakes and aliases.

🚀 Future work

Hybrid with collaborative filtering (user/item CF).
Neural text embeddings (e.g., transformer encoders) for richer semantics.
Bias control (e.g., language/popularity priors) and diversity in top-N.

🇮🇹 Italiano

Progetto di sistema di raccomandazione di film content-based con due approcci:

TF-IDF sulla sinossi (overview) e similarità con linear_kernel per restituire i 10 film più simili.
“Zuppa” di feature (keywords, cast, regista, generi) con cosine similarity per riordinare/rifinire i risultati.

Dataset & EDA.

Merge di due file sul campo id; colonne principali: id, title_x, cast, crew, budget, genres, homepage, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title_y, vote_average, vote_count. Inoltre, ~93,8% dei film sono in inglese (francese e spagnolo seguono) e si escludono titoli non rilasciati; i film con rating >5 mostrano budget medi più alti (descrittivo).

Metodi.

Overview: rimozione record senza testo, TF-IDF, linear_kernel, selezione top-N (heapsort). Zuppa: parsing stringhe→liste, normalizzazione token, costruzione soup, cosine similarity e riordino.

Come eseguire.

Apri Proyecto_Final.ipynb, installa dipendenze (pandas, numpy, scikit-learn) e lancia le celle. Cerca un titolo per ottenere i top-10 e il riordino con la zuppa.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Massive Data Mining Project.pptx		Massive Data Mining Project.pptx
Proyecto_Final.ipynb		Proyecto_Final.ipynb
README.md		README.md
tmdb.zip		tmdb.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎬 Movie Recommendation System — Content-Based

👥 Contributors

🇺🇸 English

🔍 Dataset & EDA

🧰 Methods

📦 Project structure

▶️ How to run

🧪 Example (illustrative)

✅ What you get

⚠️ Notes & limitations

🚀 Future work

🇮🇹 Italiano

Progetto di sistema di raccomandazione di film content-based con due approcci:

Dataset & EDA.

Metodi.

Come eseguire.

About

Uh oh!

Packages

Languages

JulianAndresPrego/Movie-Recommendation-System

Folders and files

Latest commit

History

Repository files navigation

🎬 Movie Recommendation System — Content-Based

👥 Contributors

🇺🇸 English

🔍 Dataset & EDA

🧰 Methods

📦 Project structure

▶️ How to run

🧪 Example (illustrative)

✅ What you get

⚠️ Notes & limitations

🚀 Future work

🇮🇹 Italiano

Progetto di sistema di raccomandazione di film content-based con due approcci:

Dataset & EDA.

Metodi.

Come eseguire.

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Languages

Packages