Julián Prego · María Remírez · Jesús Roldán
This project implements a content-based movie recommendation system. We provide two recommenders:
- Overview-based TF-IDF: recommends the 10 most similar movies using the textual overview and a vector space similarity (Scikit-learn linear_kernel).
- Feature “soup”: combines keywords, cast, director, and genres into a single text field and uses cosine similarity to reorder/refine the top results.
- Source: two files with film information, merged on a shared id. Columns include: id, title_x, cast, crew, budget, genres, homepage, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title_y, vote_average, vote_count.
- Language distribution: ~93.8% English vs 6.2% others (French/Spanish next most common).
- Status/rating/budget note: unreleased titles are excluded; films rated >5 tend to have higher average budgets (descriptive).
- Preprocessing (Overview model): drop rows without overview, build TF-IDF vectors, compute pairwise similarity with linear_kernel, and select top-N (heapsort).
- Preprocessing (Soup model): convert stringified lists to Python lists, normalize tokens, build the soup (keywords + cast + director + genres), compute cosine similarity, and reorder the initial candidates.
/Proyecto_Final.ipynb # Jupyter notebook with the full pipeline
/Massive Data Mining Project.pptx # Slides summary
/img/movies.jpg # (optional) banner image for README- Open Proyecto_Final.ipynb y ejecuta las celdas en orden (Python 3.x, scikit-learn, pandas, numpy).
- Provide a movie title to the recommender function to get top-10 similar titles (overview model), then see the re-rank with the soup model.
# Overview-based recommendations (TF-IDF + linear_kernel)
titles = df['title_x'] # or the normalized title column present in your dataset
idx = title_to_index["Inception"]
sim_scores = list(enumerate(linear_kernel(tfidf_matrix[idx], tfidf_matrix).flatten()))
top10 = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:11]
recommended = [titles[i] for i, _ in top10]
# Soup-based re-ranking (cosine similarity on combined features)
idx = title_to_index["Inception"]
sim_scores_soup = list(enumerate(cosine_similarity(soup_matrix[idx], soup_matrix).flatten()))
top10_soup = sorted(sim_scores_soup, key=lambda x: x[1], reverse=True)[1:11]
recommended_soup = [titles[i] for i, _ in top10_soup]- Two complementary content-based recommenders.
- Clean EDA insights (language mix, status filters, descriptive rating/budget note).
- Clear notebook to reproduce the pipeline end-to-end.
- Purely content-based (no user ratings/behavior).
- Quality depends on overview and metadata completeness.
- Consider deduplication/title normalization to handle remakes and aliases.
- Hybrid with collaborative filtering (user/item CF).
- Neural text embeddings (e.g., transformer encoders) for richer semantics.
- Bias control (e.g., language/popularity priors) and diversity in top-N.
- TF-IDF sulla sinossi (overview) e similarità con linear_kernel per restituire i 10 film più simili.
- “Zuppa” di feature (keywords, cast, regista, generi) con cosine similarity per riordinare/rifinire i risultati.
Merge di due file sul campo id; colonne principali: id, title_x, cast, crew, budget, genres, homepage, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title_y, vote_average, vote_count. Inoltre, ~93,8% dei film sono in inglese (francese e spagnolo seguono) e si escludono titoli non rilasciati; i film con rating >5 mostrano budget medi più alti (descrittivo).
Overview: rimozione record senza testo, TF-IDF, linear_kernel, selezione top-N (heapsort). Zuppa: parsing stringhe→liste, normalizzazione token, costruzione soup, cosine similarity e riordino.
Apri Proyecto_Final.ipynb, installa dipendenze (pandas, numpy, scikit-learn) e lancia le celle. Cerca un titolo per ottenere i top-10 e il riordino con la zuppa.