02807 Project

Quick Start with Demo Notebook

The easiest way to explore this project is through our interactive demo notebook:

📓 project_demo.ipynb

This Jupyter notebook demonstrates the complete data pipeline and analysis methods, including:

Data download and cleaning
Web scraping from Rotten Tomatoes
Frequent itemset analysis
Graph-based community detection
Collaborative filtering with similarity visualization

Running the Demo Notebook

Install dependencies:
```
task sync
```
Open the notebook:
- Open project_demo.ipynb in VS Code with the Jupyter extension, or
- Launch Jupyter Notebook/Lab: jupyter notebook project_demo.ipynb
Run cells sequentially to see the full pipeline in action

The notebook includes both full analysis commands and quick subset-based demos that run in minutes instead of hours.

Taskfile Installation

Taskfile is a task runner that helps automate repetitive tasks using a simple YAML configuration. To get Taskfile running, follow the installation instructions from taskfile.dev.

Once installed, you can run tasks defined in Taskfile.yml using the task command.

Running Tasks

To list all available tasks:

task --list

Passing parameters to tasks:

Some tasks accept command-line arguments. Use -- to separate task arguments from script arguments:

task run-frequent-items -- --min-support 0.15 --min-confidence 0.70

To see available parameters for a task:

task run-frequent-items -- --help

Data Management Tasks

Complete data setup (recommended for first-time setup):

task setup-data

Downloads, processes, and cleans all datasets in one command.

Individual data tasks:

task sync

Syncs Python dependencies using uv.

task download-data

Downloads and saves all datasets locally from Kaggle.

task clean-data

Cleans the downloaded datasets by removing nulls, invalid ratings, and mapping review scores to numeric values.

task scrape-rt-movies

Scrapes movie titles and descriptions from Rotten Tomatoes using the movie IDs from the reviews dataset. Uses 10 concurrent workers with rate limiting and supports resuming if interrupted. Creates rotten_tomatoes_movie_details.csv in data/raw/.

task retry-failed-scrapes

Retries scraping movies that previously failed during the main scraping process. Updates existing entries in the output file instead of creating duplicates - failed entries are replaced with successful results when possible.

task merge-data

Merges all cleaned datasets into a single comprehensive dataset using normalized movie titles as the merge key. Handles inconsistencies like case variations and year annotations (e.g., "Movie (2020)"). Aggregates reviews and actors into lists. Creates movies_merged.csv in data/merged/.

Data Exploration Tasks

Explore all datasets (raw and cleaned):

task explore-data

Provides statistics and descriptions for both raw and cleaned versions of all datasets.

Explore only raw datasets:

task explore-data-raw

Provides statistics and descriptions for raw datasets only.

Explore only cleaned datasets:

task explore-data-clean

Provides statistics and descriptions for cleaned datasets only.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
data		data
src/02807_project		src/02807_project
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
Taskfile.yml		Taskfile.yml
project_demo.ipynb		project_demo.ipynb
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

02807 Project

Quick Start with Demo Notebook

Running the Demo Notebook

Taskfile Installation

Running Tasks

Data Management Tasks

Data Exploration Tasks

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

osquera/02807_Project

Folders and files

Latest commit

History

Repository files navigation

02807 Project

Quick Start with Demo Notebook

Running the Demo Notebook

Taskfile Installation

Running Tasks

Data Management Tasks

Data Exploration Tasks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages