- Overview
- Key Features (detailed)
- Architecture Overview
- Technical Design Details
- Performance Considerations
- Limitations & Assumptions
- Installation & Reproducible Environment
- Usage Examples (CLI / code snippets)
- Configuration
- Contributing
- License
- Appendix / TODOs
This repository contains a modular pipeline for stock return prediction research. It covers raw data extraction from APIs, preprocessing and sequencing time-series data, a rich set of financial feature-engineering utilities (momentum, liquidity, valuation and risk measures), and several model architectures and training utilities (LSTM, Transformer, Mixture-of-Experts, and MC Dropout layers). The codebase is structured to encourage experimentation: swap feature generators, change model architectures, or plug in new data sources with minimal changes.
Each feature description below includes what it does, how it works (implementation notes), and why it matters.
-
Data extraction (src/data_extraction)
- What: Provides API client and extraction scripts to download stock price series, volume, market caps, and macroeconomic indicators.
- How: The
api_client.pyencapsulates HTTP calls and response normalization.main.pyorchestrates scheduled extraction tasks and writes raw output to disk usingsrc/utils/disk_io.py. - Why: Reliable, normalized raw data is the foundation for reproducible experiments and avoids ad-hoc scraping logic in downstream code.
-
Preprocessing pipeline (src/data_preprocessing)
- What: A set of modules for validating, cleaning, converting, filtering, generating sequences, and finalizing datasets used for training and evaluation.
- How: Contains specialized scripts:
checker.pyvalidates dataset integrity (missingness, date continuity).converters.pyhandles unit conversions and aligns periodicities (daily → monthly/weekly aggregation).filtering.pycontains stock-level and dataset-level filters (e.g., minimum liquidity, exchange inclusion).sequencing.pyconstructs fixed-length rolling windows (timestep sequences) for model input and aligns labels with prediction horizons.finalize.pycomposes feature matrices and splits datasets into train/val/test folds.
- Why: Ensures a reproducible, auditable path from raw data to model-ready tensors and separates data hygiene from modeling logic.
-
Feature engineering (src/feature_engineering)
- What: Implements financial features used in empirical asset pricing (momentum, momentum variants, liquidity measures, volatility, betas, valuation ratios).
- How: The folder contains
calculations/with focused implementations (e.g.,momentum.py,liquidity.py,ratios.py,risk.py) and agenerator/generator.pythat composes features into a single pipeline. - Why: Good features capture domain knowledge and significantly improve model signal-to-noise ratio. The modular implementation lets you add/benchmark features easily.
-
Modeling (src/modeling)
- What: Model architectures and training utilities, including LSTM, Transformer, simple dense networks, MC Dropout layer, and a Mixture-of-Experts adaptation.
- How: Implementations are split across
architectures/(model definitions),layers/(custom layers likemc_dropout.py), andmoe/(mixture-of-experts implementations).model_builder.pyprovides factory/helper functions to instantiate models. - Why: Enables quick experiments with different sequence models and uncertainty estimation (MC Dropout) and supports ensemble/expert approaches (MOE) to capture heterogenous stock behaviors.
-
Utilities and visualization (src/utils and graphs/)
- What: Helper utilities for metrics, plotting, I/O, and small experiment orchestration.
- How:
disk_io.pycentralizes save/load semantics (CSV/Parquet/npz),metrics.pyimplements evaluation metrics, andplotter.pyandmodeling/utils/visualization.pycreate baseline figures saved undergraphs/. - Why: Reusable utilities reduce duplicated code, provide consistent experiment outputs, and speed up analysis.
This section explains the high-level architecture, dataflow, responsibilities of main modules, and design patterns used.
High-level dataflow
- Extraction:
src/data_extractioncollects raw time-series and macro data and writes them toDATA_DIR/raw/. - Preprocessing:
src/data_preprocessingreads raw files, validates them, converts frequencies, filters stocks, generates feature columns, and sequences the data into model-ready arrays. Intermediate artifacts are saved underDATA_DIR/processed/. - Feature engineering:
src/feature_engineeringfunctions are invoked (from preprocessing or a separate step) to compute domain features per firm-time step. Results are merged into the processed dataset. - Modeling:
src/modelingloads processed datasets, constructs model architectures, and trains models. Trained models and logs are persisted inMODEL_DIRandgraphs/. - Evaluation: Predictions are evaluated using
src/utils/metrics.pyand visualized.
Roles and responsibilities (key files)
-
src/data_extraction/api_client.py- Role: Provides a thin wrapper for external API calls and handles rate-limiting, retries, and normalization of responses to canonical DataFrame formats. Functions typically return pandas DataFrames keyed by date and ticker.
-
src/data_extraction/main.py- Role: CLI/entry-point for scheduled extraction jobs. Calls the
api_clientand writes raw outputs usingsrc/utils/disk_io.
- Role: CLI/entry-point for scheduled extraction jobs. Calls the
-
src/data_preprocessing/converters.py- Role: Aggregation utilities (daily → monthly), alignment of timestamps, and conversion of raw financial statement formats into numeric tables.
-
src/data_preprocessing/sequencing.py- Role: Build rolling windows / sequences of fixed length. Core helper creates 3D arrays: [batch, timesteps, features]. The README appendix includes the expected nested list/array structure used by training loops.
-
src/feature_engineering/calculations/momentum.py(and other calculators)- Role: Implement momentum measures (1-, 12-, 36-month mom, chmom, indmom), max daily returns, sector-adjusted momentum. Functions expect time-indexed price arrays and return aligned series.
-
src/modeling/architectures/lstm.py- Role: Defines an LSTM-based Keras/PyTorch model for sequence regression/classification. (TODO: check exact framework — both TF and PyTorch are acceptable; inspect code to confirm.)
-
src/modeling/architectures/transformer.py- Role: Transformer-style sequence model for longer-range dependencies.
-
src/modeling/architectures/nn.pyandmodel_builder.py- Role: Lightweight dense models and factory functions to instantiate different architectures with standardized input shapes and output heads.
-
src/modeling/layers/mc_dropout.py- Role: Specialized dropout layer that stays active at inference time to provide Monte Carlo uncertainty estimates.
-
src/modeling/moe/mixture_of_experts_adapt.py- Role: Implements mixture-of-experts logic to combine specialist submodels; likely includes gating networks and expert routing.
Design patterns and engineering decisions
- Modular pipeline: The project follows a pipeline pattern that separates extraction, transformation, and modeling. This enables re-running individual steps and better unit testing.
- Factory / builder pattern:
model_builder.pycentralizes model creation to allow consistent hyperparameter wiring across experiments. - Single responsibility: Each module provides one logical responsibility: calculators only compute features, converters only transform frequency/units, and modeling modules only define networks and layers.
- Persist intermediate artifacts: The codebase favors writing processed data to disk (Parquet/CSV) for reproducibility and to avoid expensive recomputation.
graph TD
subgraph Extraction
A[api_client.py] --> B[raw data files]
end
subgraph Preprocessing
B --> C[converters.py]
C --> D[checker.py]
D --> E[filtering.py]
E --> F[sequencing.py]
F --> G[finalize.py]
end
subgraph FeatureEngineering
H[calculations/*] --> I[generator/generator.py]
I --> G
end
subgraph Modeling
G --> J[model_builder.py]
J --> K[architectures/*]
K --> L[lstm.py]
K --> M[transformer.py]
L --> N[layers/mc_dropout.py]
K --> O[moe/mixture_of_experts_adapt.py]
end
subgraph Utils
U[utils/*] --- G
U --- K
U --- B
end
%% ✅ FIXED: Removed parentheses or wrapped in quotes
G --> P["MODEL_DIR - checkpoints and artifacts"]
K --> Q["graphs/ - figures"]
This section explains the main algorithms and workflows implemented in the codebase.
- Raw extraction outputs are stored per-source and per-ticker as tabular files (CSV/Parquet). Each time-series file uses an index or column named
dateand atickeridentifier where applicable. - The API client normalizes response payloads to pandas DataFrames with consistent column names:
open,high,low,close,volume,market_cap(when available).
Key steps:
- Validation:
checker.pyensures date ranges are consistent and flags missing days or unexpectedly sparse series. It raises or logs warnings based on thresholds. - Frequency conversion:
converters.pyprovides functions to create monthly and weekly aggregations from daily EOD data. Typical operations:- monthly close: last trading day's
closeper month - monthly volume: sum of
volume - dollar volume: average price × volume aggregation
- monthly close: last trading day's
- Filters:
filtering.pyremoves stocks that fail liquidity thresholds or have insufficient history. Filters are typically parameterized insrc/config/settings.py. - Feature computation: The
feature_engineeringcalculators are called to attach engineered features to each (ticker, date) row. - Sequencing:
sequencing.pyproduces training windows: for a chosenwindow_size(e.g., 12 months) it creates sequences X of shape [N, window_size, F] and labels y of shape [N, 1] corresponding to forward return or classification bins.
Label alignment: Labels are aligned carefully to avoid lookahead bias. For example, to predict next-month return, the label for a sequence ending at month t uses returns computed from t+1.
Edge-case handling: The pipeline drops sequences with missing values beyond a configurable threshold, and optionally fills short gaps with forward/backward fill or interpolation as configured.
Major feature groups (implemented in src/feature_engineering/calculations):
-
Momentum measures
- Implementation: rolling returns over several horizons (1m, 12m, 36m), change-in-momentum (chmom), and industry-adjusted momentum (indmom). Functions operate on monthly aggregated returns and use vectorized pandas/numpy operations.
- Importance: Momentum is a persistent cross-sectional predictor in equity returns research.
-
Liquidity measures
- Implementation: dollar-volume (
dolvol), turnover (turn), zero-trading days (zerotrade) and log market value (mve). Aggregations and rolling statistics are implemented with groupby-rolling semantics. - Importance: Liquidity is correlated with return expectations and helps filter microcaps and illiquid stocks.
- Implementation: dollar-volume (
-
Risk measures
- Implementation: idiosyncratic volatility (idiovol) computed as residual STD from regressions on market returns over rolling windows; beta and beta-squared computed with rolling OLS on weekly returns.
- Importance: Controls for risk exposures and enables risk-adjusted modeling.
-
Valuation & fundamentals
- Implementation: price-to-earnings-like signals (
ep_sp), earnings growth (agr), with functions handling annual and quarterly inputs.
- Implementation: price-to-earnings-like signals (
Vectorization & batching: Calculations are written to operate on pandas Series/DataFrame columns and accept both single-ticker series and batched DataFrames. The generator composes feature columns into a final wide table.
Model construction
model_builder.pyexposes a small API to create models with a consistent signature. Typical inputs:input_shape(timesteps, features)output_dim(1 for regression or number of classes)hparams(dropout, hidden sizes, learning rate)
Training loops
- The repo includes lightweight training scaffolds (TODO: exact training loop entrypoint). Training loops:
- Load dataset artifacts (
npz/npyor memory-mapped arrays). - Create data loaders / iterators that yield batches of X, y.
- Configure optimizer, loss (e.g., MSE for regression; cross-entropy for classification), and a scheduler (TODO: confirm exact scheduler implementation).
- Run epochs with per-epoch validation evaluation and early stopping criteria.
- Load dataset artifacts (
Loss functions & optimization
- Default losses are standard regression/classification losses. The codebase supports uncertainty-aware predictions using MC Dropout: run multiple stochastic forward passes at inference (keeping dropout active) and aggregate mean and variance.
Inference
- For point predictions, models support a
predict(X)API that returns forecasted returns. - For uncertainty estimates,
mc_dropoutlayer is used at inference-time with multiple stochastic passes.
Ensembling / MoE
- Mixture-of-Experts implementation provides a gating network routing inputs to specialist experts. This is useful for handling heterogenous cross-sectional behavior (e.g., sector-specific dynamics).
- Evaluation metrics live in
src/utils/metrics.pyand include standard MSE/RMSE and custom finance metrics (e.g., information coefficient, rank correlation, decile portfolio returns). - Visualization functions can plot distributions, correlation matrices, feature importance, and sample prediction-vs-actual charts saved to
graphs/.
- Data persistence: Intermediate artifacts are saved to disk (Parquet/npz) to avoid re-computation, especially for expensive feature calculations.
- Vectorized calculations: Feature calculators use pandas/numpy vectorized ops and groupby-rolling semantics to avoid Python-level loops.
- Batch-friendly design:
sequencing.pyproduces contiguous arrays ready for fast batch ingestion into frameworks (TF/PyTorch). - IO choices: Prefer Parquet for large tabular artifacts to reduce IO overhead and memory usage.
- Parallelization: TODO: if needed, add optional multiprocessing or Dask support for feature computation across tickers.
Practical tips
- When computing rolling regressions (idiovol, beta) prefer windowed matrix operations or incremental OLS to reduce recomputation.
- Use memory-mapped numpy arrays for very large datasets to avoid memory blow-ups.
- Data source assumptions: The pipeline assumes external APIs provide consistent daily EOD data with standard columns; mismatches require connector updates.
- Missing values: The pipeline currently drops or fills missing values according to simple heuristics; more advanced imputation is optional.
- Timezones and trading calendars: Code assumes a single trading calendar; multi-market or cross-listing requires calendar-aware alignment.
- Framework agnostic: The modeling code may assume Keras or PyTorch in different files — confirm the chosen framework in the codebase before running training (TODO).
These instructions create a reproducible Python environment on Windows (PowerShell). Adjust for Linux/macOS as needed.
- Clone repository and change into it
git clone https://github.com/AdamAdham/Stock-Return-Prediction.git
cd Stock-Return-Prediction- Create and activate virtual environment (recommended)
python -m venv .venv
.\.venv\Scripts\Activate.ps1- Install core dependencies
TODO: Add requirements.txt with exact pinned versions. The list below is a minimal starting point:
pip install --upgrade pip
pip install pandas numpy scipy scikit-learn matplotlib seaborn jupyter
# If using TensorFlow (CPU)
pip install tensorflow
# Or PyTorch (CPU) - choose one
pip install torch torchvision torchaudio- (Optional) For GPU training
- Follow TensorFlow / PyTorch official docs to install GPU-enabled builds and CUDA/cuDNN matching your GPU and drivers.
Guidelines for contributors:
- Fork -> branch (feature/ or fix/ prefix) -> implement → tests → PR.
- Write unit tests covering new features (place in
tests/). Aim for deterministic tests by using small synthetic datasets. - Keep data extraction logic side-effect free where possible; prefer returning DataFrames from functions and centralizing writes in
disk_io.py. - Document assumptions and add docstrings for complex functions, especially those that implement rolling regressions or causal label alignment.
Code style
- Follow PEP8 for Python. Use
blackandflake8for automated formatting and linting.
Developer tips
- Use the modular structure: extend
src/feature_engineering/generatorto add new engineered features. - Add new model definitions in
src/modeling/architecturesand register them inmodel_builder.py. - Keep data extraction idempotent: make raw data downloads reproducible and safe to re-run.