Skip to content

osquera/02807_Project

Repository files navigation

02807 Project

Quick Start with Demo Notebook

The easiest way to explore this project is through our interactive demo notebook:

📓 project_demo.ipynb

This Jupyter notebook demonstrates the complete data pipeline and analysis methods, including:

  • Data download and cleaning
  • Web scraping from Rotten Tomatoes
  • Frequent itemset analysis
  • Graph-based community detection
  • Collaborative filtering with similarity visualization

Running the Demo Notebook

  1. Install dependencies:

    task sync
  2. Open the notebook:

    • Open project_demo.ipynb in VS Code with the Jupyter extension, or
    • Launch Jupyter Notebook/Lab: jupyter notebook project_demo.ipynb
  3. Run cells sequentially to see the full pipeline in action

The notebook includes both full analysis commands and quick subset-based demos that run in minutes instead of hours.

Taskfile Installation

Taskfile is a task runner that helps automate repetitive tasks using a simple YAML configuration. To get Taskfile running, follow the installation instructions from taskfile.dev.

Once installed, you can run tasks defined in Taskfile.yml using the task command.

Running Tasks

To list all available tasks:

task --list

Passing parameters to tasks:

Some tasks accept command-line arguments. Use -- to separate task arguments from script arguments:

task run-frequent-items -- --min-support 0.15 --min-confidence 0.70

To see available parameters for a task:

task run-frequent-items -- --help

Data Management Tasks

Complete data setup (recommended for first-time setup):

task setup-data

Downloads, processes, and cleans all datasets in one command.

Individual data tasks:

task sync

Syncs Python dependencies using uv.

task download-data

Downloads and saves all datasets locally from Kaggle.

task clean-data

Cleans the downloaded datasets by removing nulls, invalid ratings, and mapping review scores to numeric values.

task scrape-rt-movies

Scrapes movie titles and descriptions from Rotten Tomatoes using the movie IDs from the reviews dataset. Uses 10 concurrent workers with rate limiting and supports resuming if interrupted. Creates rotten_tomatoes_movie_details.csv in data/raw/.

task retry-failed-scrapes

Retries scraping movies that previously failed during the main scraping process. Updates existing entries in the output file instead of creating duplicates - failed entries are replaced with successful results when possible.

task merge-data

Merges all cleaned datasets into a single comprehensive dataset using normalized movie titles as the merge key. Handles inconsistencies like case variations and year annotations (e.g., "Movie (2020)"). Aggregates reviews and actors into lists. Creates movies_merged.csv in data/merged/.

Data Exploration Tasks

Explore all datasets (raw and cleaned):

task explore-data

Provides statistics and descriptions for both raw and cleaned versions of all datasets.

Explore only raw datasets:

task explore-data-raw

Provides statistics and descriptions for raw datasets only.

Explore only cleaned datasets:

task explore-data-clean

Provides statistics and descriptions for cleaned datasets only.

About

02807 Computational Tools for Data Science Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •