Skip to content

Anjula-valluru/SkyOps_AI-MultiAgentAirlineOps

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SkyOps Core — Southwest Airlines Delay Risk & Crew Hotspots

This repository contains the deterministic analytics backbone for SkyOps.
It focuses on operational efficiency (predicting flight delays, identifying crew hotspots) using Southwest Airlines (WN) flight data.
No GenAI is embedded here — this is pure data analytics & ML.


Project Goals

  • Delay Prediction: Identify flights at high risk of arrival delays >15 minutes.
  • Operational Efficiency: Highlight crew turnaround hotspots where late aircraft drive delays.
  • Decision Support: Export Tableau-ready risk scores and crew hotspot reports for leadership dashboards.

End-to-End Pipeline

1. Data Ingestion

  • Accepts raw BTS / Kaggle flight delay data (2019–2023).
  • Southwest-only (WN) subset is analysed (flights_2019_2023.csv).
  • Required columns:
    FL_DATE, AIRLINE_CODE, FL_NUMBER, ORIGIN, DEST, CRS_DEP_TIME, DEP_TIME, DEP_DELAY, TAXI_OUT, TAXI_IN, CRS_ARR_TIME, ARR_TIME, ARR_DELAY, CANCELLED, DIVERTED, DISTANCE, DELAY_DUE_WEATHER, DELAY_DUE_NAS, DELAY_DUE_CARRIER, DELAY_DUE_LATE_AIRCRAFT

2. Exploratory Data Analysis (EDA)

  • Missing value profiling.
  • Delay cause breakdown (Carrier vs Weather vs NAS vs Late Aircraft).
  • Outlier detection (e.g., 300+ min TaxiOut → clipped).
  • Time-based trends (delays by hour, route, weekday).
  • Outputs plots under reports/eda/.

3. Feature Engineering

  • ROUTE_TE → target encoding of ORIGIN–DEST delay history.
  • ROLL_TAXIOUT_7D → 7-day rolling TaxiOut average (airport congestion).
  • DEP_HOUR / ARR_HOUR → derived from CRS hhmm times.
  • IS_PEAK flag → captures AM/PM bank effects.
  • TURNAROUND_MIN → minutes between aircraft arrival & next departure (when TAIL_NUM is available).

4. Modeling

  • Binary classification target: ArrDelay > 15.
  • Options:
    • Scikit-learn GradientBoostingClassifier (Python pipeline).
    • XGBoost (--model xgb for stronger performance).
    • Spark ML GBTClassifier (for 5M+ row datasets).
  • Time-based train/test split (prevents leakage).

5. Outputs

  • reports/metrics.json → ROC-AUC, PR-AUC, Recall@Top-20%, OTP lift.
  • reports/skyops_scores.csv → flight-level risk scores (for Tableau).
  • reports/crew_hotspots.csv → crew hotspot flights (high-risk + short-turn or late-aircraft propagation).
  • reports/feature_importances.csv → model interpretability.
  • reports/figures/*.png → delay trends, feature distributions, model curves.

Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Run EDA

python spark/run_eda_spark.py --data data/raw/flights_2019_2023.csv

3. Train & Score (Python version)

python scripts/run_train.py --data data/raw/flights_2019_2023.csv --airport DAL --airline WN

4. Train & Score (Spark version – large datasets)

python spark/spark_train.py --data data/raw/flights_2019_2023.csv --airport DAL --airline WN --test_start 2023-07-01

5. Visualize Results

python scripts/visualize_results.py --use_spark --reports_dir reports/spark --out_dir reports/figures

Spark Notes

  • Uses Spark SQL + Window functions for heavy aggregations.
  • FL_DATE is parsed as DD-MM-YYYY (dayfirst=True).
  • On flights_2019_2023 dataset without TAIL_NUM, crew hotspots fallback to flights with LateAircraft delays.
  • If TAIL_NUM is available, true short-turnaround + high-risk detection is enabled.
  • Exports CSVs/Parquet for Tableau dashboards.

Example Outputs

  • EDA: % delayed flights, cancellation rate, delay distributions.
  • Metrics: ROC-AUC ~0.80+, Recall@Top-20% ~0.85.
  • Crew Hotspots: e.g., 400+ flights flagged at DAL in July 2023 due to late aircraft propagation.

Visualizations:

  • Delay histogram.
  • Delay causes by airport.
  • ROC / PR curves.

Environment

  • Python 3.10+
  • pandas, numpy, scikit-learn, matplotlib
  • PySpark (for large-scale pipeline)

On Windows (PowerShell), set Spark to use Anaconda python:

  • $env:PYSPARK_PYTHON="C:\Users\<you>\anaconda3\python.exe
  • $env:PYSPARK_DRIVER_PYTHON="C:\Users\<you>\anaconda3\python.exe

Why This Matters to Southwest in Real Time

  • Operational Efficiency: Predicts delays, helping allocate crews and resources proactively.
  • Safety & Reliability: Identifies propagation risks (late aircraft → next flight delays).
  • Decision-Making: Tableau-ready outputs let leadership see real-time trends and hotspots.
  • Scalability: Works on local datasets (Python) and enterprise-scale BTS data (Spark).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages