This repository is a template for the final project of the Big Data course. It provides a structured environment to build an end-to-end pipeline for analyzing US domestic flight data, performing exploratory data analysis (EDA), and building predictive models (e.g., for flight cancellations or delays).
- Python 3.x
- Access to Innopolis University Hadoop Cluster: Credentials for PostgreSQL, Hive, and HDFS.
- Secrets: Create a
secrets/directory in the project root and add:secrets/.psql.pass: Containing your PostgreSQL password on a single line.secrets/.hive.pass: Containing your Hive password on a single line.
- Clone the repository:
git clone https://github.com/Nazgulitos/BigData-project.git cd BigData-project - From the project root directory, execute:
ssh team15@hadoop-01.uni.innopolis.ru -p 22 cd BigData-project python3 -m venv venv source ./venv/bin/activate pip install -r requirements.txt bash main.sh
- All results and artifacts will be stored in the
output/directory.
data/: Contains the dataset files (raw and processed).models/: Contains the trained Spark ML models.notebooks/: Jupyter/Zeppelin notebooks for experimentation and learning (not used in the final pipeline).output/: Stores results like CSVs, text files, images from the pipeline.scripts/: Contains all.shand.pyscripts for pipeline stages.sql/: Contains all.sql(PostgreSQL) and.hql(HiveQL) files.requirements.txt: Python package dependencies.main.sh: The main script to execute the entire pipeline.
The pipeline performs the following key stages:
- Data Collection: Downloads US flight data from Kaggle and performs initial Python-based preprocessing.
- Relational DB Storage: Builds a PostgreSQL database and ingests the preprocessed data.
- HDFS Ingestion: Uses Sqoop to import data from PostgreSQL into HDFS as AVRO files with Snappy compression.
- Data Warehousing & EDA: Sets up Hive external tables on the HDFS data, performs optimizations, and runs HQL queries for EDA.
- Predictive Data Analytics: Uses Spark ML to train, evaluate, and compare models (e.g., Logistic Regression, Random Forest) for predicting flight outcomes.
We didn't remove it, just in case.
main.shis Immutable: You cannot change the content ofmain.sh. The grader will run this script as-is to assess your project. Ensure your individual scripts are correctly called and function as expected withinmain.sh.- Notebooks are for Learning Only: The
notebooks/folder is for your exploration. Your final pipeline logic must be in.pyscripts within thescripts/folder. Thenotebooks/folder might be deleted during grading to ensure your pipeline doesn't depend on its content. - Idempotency: Ensure your scripts can be run multiple times without errors (e.g., by dropping tables/directories before creating them).
- Paths: All scripts should be runnable from the project's root directory.