Big Data Project: US Flight Analysis & Prediction

This repository is a template for the final project of the Big Data course. It provides a structured environment to build an end-to-end pipeline for analyzing US domestic flight data, performing exploratory data analysis (EDA), and building predictive models (e.g., for flight cancellations or delays).

Prerequisites

Python 3.x
Access to Innopolis University Hadoop Cluster: Credentials for PostgreSQL, Hive, and HDFS.
Secrets: Create a secrets/ directory in the project root and add:
- secrets/.psql.pass: Containing your PostgreSQL password on a single line.
- secrets/.hive.pass: Containing your Hive password on a single line.

How to Run

Clone the repository:

git clone https://github.com/Nazgulitos/BigData-project.git
cd BigData-project

From the project root directory, execute:

ssh team15@hadoop-01.uni.innopolis.ru -p 22
cd BigData-project
python3 -m venv venv
source ./venv/bin/activate
pip install -r requirements.txt
bash main.sh

All results and artifacts will be stored in the output/ directory.

Project Structure

data/: Contains the dataset files (raw and processed).
models/: Contains the trained Spark ML models.
notebooks/: Jupyter/Zeppelin notebooks for experimentation and learning (not used in the final pipeline).
output/: Stores results like CSVs, text files, images from the pipeline.
scripts/: Contains all .sh and .py scripts for pipeline stages.
sql/: Contains all .sql (PostgreSQL) and .hql (HiveQL) files.
requirements.txt: Python package dependencies.
main.sh: The main script to execute the entire pipeline.

Project Overview

The pipeline performs the following key stages:

Data Collection: Downloads US flight data from Kaggle and performs initial Python-based preprocessing.
Relational DB Storage: Builds a PostgreSQL database and ingests the preprocessed data.
HDFS Ingestion: Uses Sqoop to import data from PostgreSQL into HDFS as AVRO files with Snappy compression.
Data Warehousing & EDA: Sets up Hive external tables on the HDFS data, performs optimizations, and runs HQL queries for EDA.
Predictive Data Analytics: Uses Spark ML to train, evaluate, and compare models (e.g., Logistic Regression, Random Forest) for predicting flight outcomes.

Important Notes for Students

We didn't remove it, just in case.

main.sh is Immutable: You cannot change the content of main.sh. The grader will run this script as-is to assess your project. Ensure your individual scripts are correctly called and function as expected within main.sh.
Notebooks are for Learning Only: The notebooks/ folder is for your exploration. Your final pipeline logic must be in .py scripts within the scripts/ folder. The notebooks/ folder might be deleted during grading to ensure your pipeline doesn't depend on its content.
Idempotency: Ensure your scripts can be run multiple times without errors (e.g., by dropping tables/directories before creating them).
Paths: All scripts should be runnable from the project's root directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Project: US Flight Analysis & Prediction

Prerequisites

How to Run

Project Structure

Project Overview

Important Notes for Students

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
data		data
models		models
notebooks		notebooks
output		output
scripts		scripts
sql		sql
.gitignore		.gitignore
README.MD		README.MD
main.sh		main.sh
presentation.pdf		presentation.pdf
report.pdf		report.pdf
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Big Data Project: US Flight Analysis & Prediction

Prerequisites

How to Run

Project Structure

Project Overview

Important Notes for Students

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages