This repository contains the source code for a specialized Automated Machine Learning (AutoML) pipeline developed to forecast the daily log-returns of the Dow Jones Industrial Average (DJI).
Designed as part of a Master's Thesis in Theoretical Economics, this research framework integrates macroeconomic theory with computational intelligence. The pipeline aggregates data from over 50 historical constituents of the Dow Jones, major macroeconomic indicators, and qualitative geopolitical events to construct a high-dimensional feature space for empirical asset pricing models.
The model utilizes a heterogeneous dataset constructed from the following sources:
- Target Variable: Daily Log-Returns of the Dow Jones Industrial Average (
^DJI). - Corporate Data: Historical OHLCV (Open, High, Low, Close, Volume) data for 50+ historical and current constituents of the Dow Jones, sourced via Yahoo Finance with a "warm-up" period to ensure moving average consistency.
- Macroeconomic Indicators: Key economic drivers sourced from the Federal Reserve Economic Data (FRED), including Federal Funds Rate, CPI, GDP, and Unemployment Rate.
- Global Market Indices: Auxiliary financial variables including VIX, Gold, Oil (WTI), DXY, and major global indices (DAX, FTSE, Nikkei).
- Qualitative Events: A bespoke dataset of binary dummy variables representing major exogenous shocks, such as U.S. Presidential Elections, natural disasters, and significant political stability indices.
The pipeline follows a strict econometric and engineering workflow:
- Missing Value Imputation: Hierarchical handling of missing data using out-of-index logic and forward-filling, strictly avoiding look-ahead bias.
- Outlier Management: Application of Winsorization (0.5% - 99.5% quantiles) to variables with extreme kurtosis to mitigate the impact of distributional tails without data removal.
- Stationarity Tests: Automated Augmented Dickey-Fuller (ADF) tests performed on all predictors to ensure time-series stability and prevent spurious regression results.
- Memory Optimization: Aggressive downcasting of numerical datatypes (e.g.,
float64tofloat32) to enable high-dimensional processing in memory-constrained environments.
Technical indicators are generated for each constituent stock to capture momentum, volatility, and trend signals:
- Moving Averages (SMA, EMA)
- Relative Strength Index (RSI)
- Moving Average Convergence Divergence (MACD)
- Bollinger Bands and Average True Range (ATR)
- Volume-weighted metrics (OBV, MFI)
- Splitting Strategy: A strict chronological split is used: Training (70%), Validation (15%), and Testing (15%).
- Baseline Model: A Multivariate Linear Regression model is established as the econometric benchmark.
- AutoML Engine: The H2O.ai framework is utilized to train and cross-validate a wide range of algorithms (GBM, DRF, GLM) within a 1-hour runtime limit (
max_runtime_secs=3600), optimizing for Root Mean Squared Error (RMSE).
- Statistical Significance: The Diebold-Mariano Test is implemented to statistically compare the predictive accuracy of the optimal H2O model against the linear baseline and other competing models.
- Explainable AI (XAI): Post-hoc interpretation is conducted using SHAP (SHapley Additive exPlanations) values to quantify feature importance and non-linear relationships.
This script is specifically optimized for Google Colab. It includes an initialization module that automatically handles the installation of all dependencies (Java, H2O, technical analysis libraries).
Usage:
- Click the "Open in Colab" badge above (or open the
.ipynbfile). - Execute the cells sequentially. The script is self-contained and requires no external configuration files.
The execution generates the following artifacts:
- Data Files:
Thesis_Data_File.csv,Stationarity_Results.csv,Correlation_Results.csv. - Model Metrics:
Baseline_Model_Performance.csv,H2O_AutoML_Leaderboard.csv,Diebold_Mariano_Test_Results.csv. - Visualizations:
Scatter_Plot_Actual_vs_Predicted.png,Feature_Importance.png,SHAP_Summary_Plot.png.
Copyright (C) 2025 Mohammad Rasoul Mostafavi Marian
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the LICENSE file for more details.
Author: Mohammad Rasoul Mostafavi Marian M.Sc. in Theoretical Economics | Researcher in Quantitative Finance


