This repository contains an end-to-end workflow that ingests two weeks of ORD departure data, builds a daily Flight Difficulty Score, and packages the core deliverables requested in the United Airlines frontline operations challenge.
- Goal: Highlight flights most likely to experience operational difficulty (≥15 min departure delay) so frontline teams can pre-stage resources.
- Approach: Enriched feature set (turn buffers, passenger mix, baggage pressure, temporal context) feeding a calibrated HistGradientBoosting model with rule-based recall boosters and SHAP explainability.
- Impact: Holdout recall 97% at 54% precision; illustrative cost model shows ~$626K/day savings vs. reactive operations. Artifacts include the scored dataset, insights report, visualisations, and SQL/EDA outputs.
├── data/
│ ├── raw/ # Original CSV extracts (already provided)
│ ├── interim/ # DuckDB file and any cached tables
│ └── processed/ # Feature tables (Parquet)
├── artifacts/
│ ├── figures/ # PNG visualizations generated during EDA
│ ├── tables/ # Aggregated CSV / JSON outputs (EDA + SQL + insights)
│ └── models/ # Serialized difficulty model artifact
├── sql/ # DuckDB SQL notebooks executed during the pipeline
├── src/ # Python source modules (ingestion, features, scoring, EDA)
├── test_databaes.csv # Final flight-level difficulty export
├── reports/report.txt # Narrative summary of findings & recommendations
└── README.md # This file
- Ensure Python 3.12 is available (a
.venvis already configured inside the Codespace). - Install dependencies:
/workspaces/skyhack/.venv/bin/python -m pip install -r requirements.txtThe Codespace provisioning step has already executed the command above; rerun it only if new packages are added.
Run the orchestrated pipeline from the project root:
/workspaces/skyhack/.venv/bin/python -m src.pipelineThe run performs the following actions:
- Loads all raw CSVs into pandas and DuckDB.
- Engineers flight-level, passenger, baggage, and special-service features—including temporal context such as minutes-since-first departure and bank position.
- Trains a HistGradientBoosting model (with class weighting), applies isotonic calibration, and blends in a rule-based lift to prioritize recall on obvious high-risk flights.
- Generates daily difficulty rankings and classifications (Difficult / Medium / Easy), attaches SHAP driver text & interaction plots, and exports everything to
test_databaes.csv. - Produces EDA visualizations, summary statistics, SQL-driven tables, cost-benefit analytics, and actionable insights.
- Saves serialized artifacts (model + calibrator bundle, metrics, recommended actions) for transparency.
If you are evaluating this submission on your own machine (outside the Codespace), follow these steps:
- Clone the repository and enter the project directory
git clone https://github.com/rishi-dtu-official/skyhack.git cd skyhack - Create and activate a Python 3.12 virtual environment
(Windows PowerShell)
python3 -m venv .venv source .venv/bin/activatepython -m venv .venv .venv\Scripts\Activate.ps1 - Install project dependencies
pip install -r requirements.txt
- (Optional) Remove heavy artifacts for a clean run
rm -rf artifacts data/processed test_databaes.csv visualisation
- Execute the end-to-end pipeline
python -m src.pipeline
- Review outputs
test_databaes.csv(flight difficulty export)reports/report.txt(narrative summary)artifacts/figures/(EDA plots + SHAP interaction)visualisation/(calibration curve, confusion matrix, cost comparison, feature bars)artifacts/tables/(metrics, feature importances, cost-benefit, SQL results)
- Deactivate the virtual environment when finished
deactivate
test_databaes.csv: Flight identifiers, feature highlights, raw & blended probabilities, rule trigger flag, SHAP driver text, daily rank, and difficulty class.reports/report.txt: Narrative report answering all EDA questions, describing the scoring approach, and highlighting operational recommendations.artifacts/figures/*.png: Visuals referenced in the report (now includes SHAP interaction plots).artifacts/tables/*.csv|json: Supporting tables including model metrics, SHAP feature importances, cost-benefit analysis, EDA summaries, and DuckDB query outputs.visualisation/*.png: Calibration curve, confusion matrix, top feature bars, and daily cost comparison to accompany the executive summary.
- Add new engineered features in
src/feature_engineering.pyand rerun the pipeline. - Swap the classifier or hyperparameters in
src/scoring.pyto test alternative scoring strategies. - Place additional SQL queries inside
sql/and they will be executed automatically on the scored feature table. - Use the processed Parquet tables (
data/processed/) to explore ad-hoc analytics or to build dashboards.
- If the pipeline fails because of missing dependencies, rerun the install command listed above.
- DuckDB results are written to
data/interim/skyhack.duckdb. Remove the file to force a clean rebuild of SQL tables. - For large dataset experiments, consider enabling lazy reading via DuckDB to reduce memory pressure.