This repository contains the deterministic analytics backbone for SkyOps.
It focuses on operational efficiency (predicting flight delays, identifying crew hotspots) using Southwest Airlines (WN) flight data.
No GenAI is embedded here — this is pure data analytics & ML.
- Delay Prediction: Identify flights at high risk of arrival delays >15 minutes.
- Operational Efficiency: Highlight crew turnaround hotspots where late aircraft drive delays.
- Decision Support: Export Tableau-ready risk scores and crew hotspot reports for leadership dashboards.
- Accepts raw BTS / Kaggle flight delay data (2019–2023).
- Southwest-only (WN) subset is analysed (
flights_2019_2023.csv). - Required columns:
FL_DATE, AIRLINE_CODE, FL_NUMBER, ORIGIN, DEST, CRS_DEP_TIME, DEP_TIME, DEP_DELAY, TAXI_OUT, TAXI_IN, CRS_ARR_TIME, ARR_TIME, ARR_DELAY, CANCELLED, DIVERTED, DISTANCE, DELAY_DUE_WEATHER, DELAY_DUE_NAS, DELAY_DUE_CARRIER, DELAY_DUE_LATE_AIRCRAFT
- Missing value profiling.
- Delay cause breakdown (Carrier vs Weather vs NAS vs Late Aircraft).
- Outlier detection (e.g., 300+ min TaxiOut → clipped).
- Time-based trends (delays by hour, route, weekday).
- Outputs plots under
reports/eda/.
ROUTE_TE→ target encoding of ORIGIN–DEST delay history.ROLL_TAXIOUT_7D→ 7-day rolling TaxiOut average (airport congestion).DEP_HOUR/ARR_HOUR→ derived from CRS hhmm times.IS_PEAKflag → captures AM/PM bank effects.TURNAROUND_MIN→ minutes between aircraft arrival & next departure (whenTAIL_NUMis available).
- Binary classification target:
ArrDelay > 15. - Options:
- Scikit-learn GradientBoostingClassifier (Python pipeline).
- XGBoost (
--model xgbfor stronger performance). - Spark ML GBTClassifier (for 5M+ row datasets).
- Time-based train/test split (prevents leakage).
reports/metrics.json→ ROC-AUC, PR-AUC, Recall@Top-20%, OTP lift.reports/skyops_scores.csv→ flight-level risk scores (for Tableau).reports/crew_hotspots.csv→ crew hotspot flights (high-risk + short-turn or late-aircraft propagation).reports/feature_importances.csv→ model interpretability.reports/figures/*.png→ delay trends, feature distributions, model curves.
pip install -r requirements.txtpython spark/run_eda_spark.py --data data/raw/flights_2019_2023.csvpython scripts/run_train.py --data data/raw/flights_2019_2023.csv --airport DAL --airline WNpython spark/spark_train.py --data data/raw/flights_2019_2023.csv --airport DAL --airline WN --test_start 2023-07-01python scripts/visualize_results.py --use_spark --reports_dir reports/spark --out_dir reports/figures- Uses Spark SQL + Window functions for heavy aggregations.
- FL_DATE is parsed as DD-MM-YYYY (dayfirst=True).
- On flights_2019_2023 dataset without TAIL_NUM, crew hotspots fallback to flights with LateAircraft delays.
- If TAIL_NUM is available, true short-turnaround + high-risk detection is enabled.
- Exports CSVs/Parquet for Tableau dashboards.
EDA:% delayed flights, cancellation rate, delay distributions.Metrics:ROC-AUC ~0.80+, Recall@Top-20% ~0.85.Crew Hotspots:e.g., 400+ flights flagged at DAL in July 2023 due to late aircraft propagation.
- Delay histogram.
- Delay causes by airport.
- ROC / PR curves.
- Python 3.10+
- pandas, numpy, scikit-learn, matplotlib
- PySpark (for large-scale pipeline)
$env:PYSPARK_PYTHON="C:\Users\<you>\anaconda3\python.exe$env:PYSPARK_DRIVER_PYTHON="C:\Users\<you>\anaconda3\python.exe
- Operational Efficiency: Predicts delays, helping allocate crews and resources proactively.
- Safety & Reliability: Identifies propagation risks (late aircraft → next flight delays).
- Decision-Making: Tableau-ready outputs let leadership see real-time trends and hotspots.
- Scalability: Works on local datasets (Python) and enterprise-scale BTS data (Spark).