Beyond Accuracy: A Stability-Aware Metric for Multi-Horizon Forecasting

This repository contains the code for the paper titled "Beyond Accuracy: A Stability-Aware Metric for Multi-Horizon Forecasting" by Chutian Ma, Grigorii Pomazkin, Giacinto Paolo Saggese, and Paul Smith.

Overview

This project introduces the Forecast Accuracy and Coherence (AC) Score, a novel metric for evaluating probabilistic multi-horizon forecasts that accounts for both accuracy and stability. Unlike traditional metrics focused on individual horizons, the AC Score measures how consistently models predict the same future events as forecast origins change along with multi-horizon accuracy.

Key Results

Our AC-optimized SARI models achieve:

91.1% reduction in forecast volatility for the same target timestamps
Up to 26% median improvement in medium to long horizon accuracy
Modest one-step-ahead accuracy trade-off (7.5% on average)

Repository Contents

.
├── notebooks/                      # Jupyter notebooks for analysis
├── outcomes/                       # Output directory for results
├── helpers/                        # Helper functions 
├── forecast_metric_utility.py      # Training scripts
├── differentiable_arima.py         # Differentiable SARIMA implementation
├── requirements.txt                # Python dependencies
└── README.md                       # This file

Installation

Setup

Clone this repository:

git clone https://github.com/causify-ai/beyond_accuracy.git
cd beyond_accuracy

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Dependencies

The main dependencies include:

torch - PyTorch for differentiable model implementation
pmdarima - Auto-ARIMA for hyperparameter selection
statsmodels - Traditional SARIMA baseline models
pandas - Data manipulation
numpy - Numerical computations
matplotlib - Visualization
fev - Interface for benchmark datasets, including M4

Full dependencies are listed in requirements.txt.

Data

The experiments use the M4 Hourly benchmark dataset.

Usage to Replicate Paper Results

To reproduce the main results from the paper:

python run_experiment.py --batch_size 3 --num_threads 8 --dst_dir "./outcomes/...(subfolder name)"

This will:

Load the M4 Hourly dataset and divide into batches
Execute the following in parallel
Split each series into 60% training / 40% test
Use auto-ARIMA to select hyperparameters
Train traditional MLE-based SARI models (baseline)
Train AC-optimized SARI models
Generate out-of-sample forecasts
Compute evaluation metrics
Save results to the specified dst_dir folder

Note: The process takes approximately 12-20 hours on 8 core CPUs. To run partial experiments, additional argument can be added to set the size of experiment. For example, using --num_test 50 will randomly select 50 time series from the dataset (default seed 42).

Implementation Details

AC Score Metric

The Forecast AC Score combines two components:

Accuracy term: Multi-horizon energy score with horizon-specific weights
Stability term: Energy distance between forecasts targeting the same timestamp

The metric is implemented as:

AC_score = Accuracy + λ × Stability

where λ is the stability multiplier (default: 0.5).

Differentiable SARIMA

The SARI model is implemented in PyTorch with:

Autoregressive and seasonal autoregressive coefficients as learnable parameters
Initialization from auto-ARIMA hyperparameter search

Horizon Weights

By default, we use linear decay weights:

w(h) = 1 - h/H

, which emphasizes shorter horizons while maintaining awareness of longer horizons.

Citation

If you use this code in your research, please cite:

@article{ma2026beyond,
  title={Beyond Accuracy: A Stability-Aware Metric for Multi-Horizon Forecasting},
  author={Ma, Chutian and Pomazkin, Grigorii and Saggese, Giacinto Paolo and Smith, Paul},
  journal={arXiv preprint arXiv:2601.10863},
  year={2026}
}

Requirement

Python version 3.12 or higher. See requirements.txt for required packages.

Contact

For questions or issues, please open an issue on GitHub or contact:

Chutian Ma: c.ma@causify.ai
Grigorii Pomazkin: g.pomazkin@causify.ai
Giacinto Paolo Saggese: gp@causify.ai
Paul Smith: paul@causify.ai

License

This project is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
helpers		helpers
outcomes		outcomes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyze_outcomes.ipynb		analyze_outcomes.ipynb
analyze_outcomes.py		analyze_outcomes.py
differentiable_arima.py		differentiable_arima.py
forecast_metric_utility.py		forecast_metric_utility.py
requirements.txt		requirements.txt
run_experiment.py		run_experiment.py
train_model.py		train_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond Accuracy: A Stability-Aware Metric for Multi-Horizon Forecasting

Overview

Key Results

Repository Contents

Installation

Setup

Dependencies

Data

Usage to Replicate Paper Results

Implementation Details

AC Score Metric

Differentiable SARIMA

Horizon Weights

Citation

Requirement

Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Beyond Accuracy: A Stability-Aware Metric for Multi-Horizon Forecasting

Overview

Key Results

Repository Contents

Installation

Setup

Dependencies

Data

Usage to Replicate Paper Results

Implementation Details

AC Score Metric

Differentiable SARIMA

Horizon Weights

Citation

Requirement

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages