Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
48fcb9c
fix np.array error
xiaolong0728 Jul 10, 2025
ad5814b
fix test
xiaolong0728 Jul 10, 2025
0527822
remove print
xiaolong0728 Jul 10, 2025
6bfd9b5
allow transform multiple targets
xiaolong0728 Jul 11, 2025
419c20e
remove comments
xiaolong0728 Jul 15, 2025
cf76d48
add mse
xiaolong0728 Jul 15, 2025
762be86
update functions
xiaolong0728 Jul 15, 2025
bbb4add
remove comment
xiaolong0728 Jul 16, 2025
026a5c7
more metrics to be done
xiaolong0728 Jul 16, 2025
e82889d
add adr 004 evaluation input schema
xiaolong0728 Jul 17, 2025
43f10c8
add mse to default metric list
xiaolong0728 Jul 17, 2025
01962a8
no config
xiaolong0728 Jul 18, 2025
541c338
add adr 004
xiaolong0728 Jul 18, 2025
aecb943
add adr 005
xiaolong0728 Jul 18, 2025
c796184
update ADR 005
xiaolong0728 Jul 18, 2025
79ac74f
update ADR 002
xiaolong0728 Jul 18, 2025
3291a8d
update adr
xiaolong0728 Jul 21, 2025
fe6d971
add eval dictionary generator
xiaolong0728 Jul 21, 2025
a49b90e
update adr
xiaolong0728 Jul 21, 2025
803198d
revert
xiaolong0728 Jul 21, 2025
f08d90c
add msle
xiaolong0728 Jul 21, 2025
ff1b415
update generator
xiaolong0728 Jul 21, 2025
72568de
update logic
xiaolong0728 Jul 28, 2025
36d2eb7
move mean prediction to metric
xiaolong0728 Jul 28, 2025
1028a39
rename to y_hat_bar
xiaolong0728 Jul 28, 2025
f9829a4
update index querying
xiaolong0728 Jul 28, 2025
a192141
update name
xiaolong0728 Jul 28, 2025
cc9637b
update quickstart
xiaolong0728 Jul 30, 2025
76343a9
add post analysis
xiaolong0728 Jul 30, 2025
43e9f75
update match_actual_pred to deal with missing countries and duplicated
xiaolong0728 Jul 31, 2025
7270606
refactor index matching in EvaluationManager to improve handling of a…
xiaolong0728 Aug 1, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 55 additions & 37 deletions documentation/ADRs/002_evaluation_strategy.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,81 +4,99 @@
|---------------------|-------------------|
| Subject | Evaluation Strategy |
| ADR Number | 002 |
| Status | Accepted|
| Author | Mihai, Xiaolong|
| Date | 31.10.2024 |
| Status | Proposed |
| Author | Xiaolong, Mihai|
| Date | 16.07.2025 |

## Context
The primary output of VIEWS is a panel of forecasts, which consist of temporal sequences of predictions for each observation at the corresponding level of analysis (LOA). These forecasts span a defined forecasting window and can represent values such as predicted fatalities, probabilities, quantiles, or sample vectors.
To ensure reliable and realistic model performance assessment, our forecasting framework supports both **offline** and **online** evaluation strategies. These strategies serve complementary purposes: offline evaluation simulates the forecasting process retrospectively, while online evaluation assesses actual deployed forecasts against observed data.

The machine learning models generating these predictions are trained on historical time-series data, supplemented by covariates that may be either known in advance (future features) or only available for training data (shifted features). Given the variety of models used, evaluation routines must remain model- and paradigm-agnostic to ensure consistency across different methodologies.
Both strategies are designed to work with time-series predictions and support multi-step forecast horizons, ensuring robustness across temporal scales and use cases.


## Decision
The evaluation strategy must be structured to assess predictive performance comprehensively.
We adopt a dual evaluation approach consisting of:
1. **Offline Evaluation:** Evaluating a model's performance on historical data, before deployment.

2. **Online Evaluation:** The ongoing process of evaluating a deployed model's performance as new, real-world data becomes available.


### Points of Definition:

1. *Time*: All time points and horizons mentioned below are in **outcome space**, also known as $Y$-space : this means that they refer to the time point of the (forecasted or observed) outcome. This is especially important for such models where the feature-space and outcome-space are shifted and refer to different time points.
- **Rolling-Origin Holdout:** A robust backtesting strategy that simulates a real-world forecasting scenario by generating forecasts from multiple, rolling time origins.

2. *Temporal resolution*: The temporal resolution of VIEWS is the calendar-month. These are referred in VIEWS by an ordinal (Julian) month identifier (`month_id`) which is a serial numeric identifier with a reference epoch (month 0) of December 1979. For control purposes, January 2024 is month 529. VIEWS does not define behavior and does not have the ability to store data prior to the reference epoch (with negative `month_id`). Conflict history data, which marks the earliest possible meaningful start of the training time-series, is available from `month_id==109`.
- **Forecast Steps:** The time increment between predictions within a \textbf{sequence} of forecasts (further referred to as steps).

3. *Forecasting Steps* (further referred to as steps) is defined as the 1-indexed number of months from the start of a forecast time-series.
- **Sequence:** An ordered set of data points indexed by time.

### General Evaluation Strategy
### Diagram
![path](../img/approach.png)

The general evaluation strategy involves *training* one model on a time-series that goes up to the training horizon $H_0$. This sequence is then used to predict a number of sequences (time-series). The first such sequence goes from $H_{0+1}$ to $H_{0+36}$, thus containing 36 forecasted values -- i.e. 36 months. The next one goes from $H_{0+2}$ to $H_{0+37}$. This is repeated until we reach a constant stop-point $k$ such that the last sequence forecasted is $H_{0+k+1}$ to $H_{0+k+36}$.
### Offline Evaluation
We adopt a **rolling-origin holdout evaluation strategy** for all offline (backtesting) evaluations.

Normally, it is up to the modeller whether the model performs *expanding window* or *rolling window* evaluation, since *how* prediction is carried out all evaluations are of the *expanding window forecasting* type, i.e. the training window.
The offline evaluation strategy involves
1. **A single model** is trained on historical data up to training cutoff $H_0$.
2. Using this trained model object, a forecast is generated for the next **36 months**:
- Sequence 1: $H_{0+1}$ -> $H_{0+36}$
3. The origin is then rolled forward by one month, and another forecast is generated:
- Sequence 2: $H_{0+2}$ -> $H_{0+37}$
4. This process continues until a fixed number of sequences **k** is reached.
5. In our standardized offline evaluation, **12 forecast sequences** are used (i.e., k = 12).

#### Live evaluation
It is important to note that **offline evaluation is not a true forecast**. Instead it is a simulation using historical data from the **Validation Partition** to approximate forecasting performance under realistic, rolling deployment conditions. (See [ADR TBD] for data partitioning strategy.)

For **live** evaluation, we suggest doing this in the same way as has been done for VIEWS2020/FCDO (_confirm with HH and/or Mike_), i.e. predict to k=12, resulting in *12* time series over a prediction window of *48* months. We call this the evaluation partition end $H_{e,live}$. This gives a prediction horizon of 48 months, thus $H_{47}$ in our notation.

Note that this is **not** the final version of online evaluation.
### Online Evaluation
Online evaluation reflects **true forecasting** and is based on the **Forecasting Partition**

#### Offline evaluation
Suppose the latest available data point is $H_{36}$. Over time, the system would have generated the following forecast sequences:
- Sequence 1: forecast for $H_{1}$ → $H_{36}$, generated at time **t = 0**
- Sequence 2: forecast for $H_{2}$ → $H_{37}$, generated at **t = 1**
- ...
- Sequence 36: forecast for $H_{36}$ → $H_{71}$, generated at **t = 35**

For **offline** model evaluation, we suggest doing this in a way that simulates production over a longer time-span. For this, a new model is trained at every **twelve** months interval, thus resetting $H_0$ at months $H_{0+0}, H_{0+12}, H_{0+24}, \dots H_{0+12r}$ where $12r=H_e$.
At time $H_{36}$, we evaluate all forecasts made for $H_{36}$, i.e., the predictions from each of these 36 sequences are compared to the true value observed at $H_{36}$.

The default way is to set $H_{e_eval}$ to 48 months, meaning we only train the model once at $H_0$. This will result in **12** time series. We call it **standard** evaluation.
This provides a comprehensive view of how well the deployed model performs across multiple forecast origins and steps.

We also propose the following practical approaches:
## Consequences

1. A **long** evaluation where we set $H_{e_eval}$ to 72 months. This will result in *36* predicted time-series.

2. A **complete** evaluation system, the longest one, where we set $H_0$ at 36 months of data (157 for models depending on UCDP GED), and iterate until the end of data (currently, the final $H_0$ will be 529).
**Positive Effects:**
- Reflects realistic deployment and monitoring conditions.

For comparability and abstraction of seasonality (which is inherent in both the DGP as well as the conflict data we rely on, due to their definition), $H_0$ should always be December or June (this also adds convenience).
- Allows for evaluation across multiple forecast origins and time horizons.

The three approaches have trade-offs besides increasing computational complexity. Since conflict is not a stationary process, evaluation carried for long time-periods will prefer models that predict whatever stationary components exist in the DGP (and thus in the time-series). For example these may include salient factors such GDP, HDI, infant mortality etc.. Evaluation on such very long time-spans may substantially penalize models that predict more current event, due shorter term causes that were not so salient in the past. Examples of these may be the change in the taboo on inter-state war after 2014 and 2022 with Russia invading Ukraine.
- Improves robustness by capturing temporal variation in model performance.


**Negative Effects:**
- Requires careful alignment of sequences and forecast windows.

## Consequences
- May introduce computational overhead due to repeated evaluation across multiple origins.

**Positive Effects:**
- Standardized evaluation across models, ensuring comparability.
- Models must be capable of generalizing across slightly shifted input windows.

- Clear separation of live and offline evaluation, facilitating both operational monitoring and research validation.

**Negative Effects:**
- Increased computational demands for long and complete evaluations.
## Rationale
The dual evaluation setup strikes a balance between experimentation and real-world monitoring:

- Potential complexity in managing multiple evaluation strategies.
- **Offline evaluation** provides a controlled and reproducible environment for backtesting.
- **Online evaluation** reflects actual model behavior in production.

For further technical details:
- See [ADR 004 – Evaluation Input Schema](https://github.com/views-platform/views-evaluation/blob/main/documentation/ADRs/004_evaluation_input_schema.md)
- See [ADR 003 – Metric Calculation](https://github.com/views-platform/views-evaluation/blob/main/documentation/ADRs/003_metric_calculation.md)

- Additional infrastructure requirements.

## Rationale
By structuring evaluation routines to be agnostic of the modeling approach, the framework ensures consistency in assessing predictive performance. Using multiple evaluation methodologies balances computational feasibility with robustness in performance assessment.

### Considerations
- Computational cost vs. granularity of evaluation results.
- Sequence length (currently 36 months) may need to be adjusted for different use cases (e.g., quarterly or annual models).

- The number of sequences (k) can be tuned depending on evaluation budget or forecast range.

- Trade-offs between short-term and long-term predictive performance.
- Consider future support for probabilistic or uncertainty-aware forecasts in the same rolling evaluation framework.

- Ensuring reproducibility and scalability of evaluation routines.


## Feedback and Suggestions
Expand Down
55 changes: 55 additions & 0 deletions documentation/ADRs/004_evaluation_input_schema.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Evaluation Input Schema

| ADR Info | Details |
|---------------------|-------------------|
| Subject | Evaluation Input Schema |
| ADR Number | 004 |
| Status | Proposed |
| Author | Xiaolong |
| Date | 16.06.2025 |

## Context
In our modeling pipeline, a consistent and flexible evaluation framework is essential to compare model performance.


## Decision

We adopt the `views-evaluation` package to standardize the evaluation of model predictions. The core component of this package is the `EvaluationManager` class, which is initialized with a **list of evaluation metrics**.

The `evaluate` method accepts the following inputs:
1. A DataFrame of actual values,
2. A list of prediction DataFrames,
3. The target variable name,
4. The model config.

Both the actual and prediction DataFrames must use a multi-index of `(month_id, country_id/priogrid_gid)` and contain a column for the target variable. In the actuals DataFrame, this column must be named exactly as the target. In each prediction DataFrame, the predicted column must be named `f'pred_{target}'`.

The number of prediction DataFrames is flexible. However, the standard practice is to evaluate **12 sequences**. When more than two predictions are provided, the evaluation will behave similarly to a **rolling origin evaluation** with a **fixed holdout size of 1**. For further reference, see the [ADR 002](https://github.com/views-platform/views-evaluation/blob/main/documentation/ADRs/002_evaluation_strategy.md) on rolling origin methodology.

The class automatically determines the evaluation type (point or uncertainty) and aligns `month_id` values between the actuals and each prediction. By default, the evaluation is performed **month-wise**, **step-wise**, **time-series-wise** (more information in [ADR 003](https://github.com/views-platform/views-evaluation/blob/main/documentation/ADRs/003_metric_calculation.md))


## Consequences

**Positive Effects:**

- Standardized evaluation across all models.

**Negative Effects:**

- Requires strict adherence to index and column naming conventions.

## Rationale

Using the `views-evaluation` package enforces consistency and reproducibility in model evaluation. The built-in support for rolling origin evaluation reflects a realistic scenario for time-series forecasting where the model is updated or evaluated sequentially. Its flexible design aligns with our workflow, where multiple prediction sets across multiple horizons are common.


### Considerations

- Other evaluation types, such as correlation matrices, may be requested in the future. These might not be compatible with the current architecture or evaluation strategy of the `views-evaluation` package.

- Consider accepting `config` as input instead of separate `target` and `steps` arguments. This would improve consistency because these parameters are already defined in config. It would allow for more flexible or partial evaluation workflows (e.g., when only one or two evaluation strategies are desired).

## Feedback and Suggestions
Any feedback or suggestion is welcomed

110 changes: 110 additions & 0 deletions documentation/ADRs/005_evaluation_output_schema.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Evaluation Output Schema

| ADR Info | Details |
|---------------------|-------------------|
| Subject | Evaluation Output Schema |
| ADR Number | 005 |
| Status | Proposed |
| Author | Xiaolong |
| Date | 16.06.2025 |

## Context
As part of our model evaluation workflow, we generate comprehensive reports summarizing model performance across a range of metrics and time periods. These reports are intended primarily for comparing ensemble models against their constituent models and baselines.

## Decision

We define a standard output schema for model evaluation reports using two formats:

1. **JSON file** – machine-readable output storing structured evaluation data.
2. **HTML file** – human-readable report with charts, tables, and summaries.

These files are stored in the `reports/` directory for each model within `views-models`.

To prevent a circular dependency between `views-evaluation` and `views-pipeline-core`, the `views-evaluation` package returns the evaluation dictionary, and then `views-pipeline-core` continues saving it as a json file.

### Schema Overview (JSON)
Each report follows a standardized JSON structure that includes:
````
{
"Target": "target",
"Forecast Type": "point",
"Level of Analysis": "cm",
"Data Partition": "validation",
"Training Period": [121,492],
"Testing Period": [493,540],
"Forecast Horizon": 36,
"Number of Rolling Origins": 12,
"Evaluation Results": [
{
"Type": "Ensemble",
"Model Name": "ensemble_model",
"MSE": mse_e,
"MSLE": msle_e,
"mean prediction": mp_e
},
{
"Type": "Constituent",
"Model Name": "constitute_a",
"MSE": mse_a,
"MSLE": msle_a,
"mean prediction": mp_a
},
{
"Type": "Constituent",
"Model Name": "constitute_b",
"MSE": mse_b,
"MSLE": msle_b,
"mean prediction": mp_b
}
...
]
}
````
Here, the

The output file is name with the following name convention:
```
eval_validation_{conflict_type}_{timestamp}.json
```



## Consequences

**Positive Effects:**

- Avoids circular dependency between `views-evaluation` and `views-pipeline-core`.

- Provides consistent input for both HTML rendering and potential downstream systems (e.g., dashboards, APIs)

- Facilitates modularity and separation of concerns.


**Negative Effects:**

- Requires tight coordination between both packages to maintain schema compatibility

- Some redundancy between evaluation and report generation may occur

- May require schema migrations as new report sections are added



## Rationale

Saving reports within `views-pipeline-core` ensures full control over rendering, formatting, and contextual customization (e.g., comparing different model families). By letting `views-evaluation` focus strictly on metrics and alignment logic, we maintain cleaner package boundaries.


### Considerations

- This schema may evolve as we introduce new types of evaluation (e.g., correlation matrix).

- Reports are currently only generated for **ensemble models**, as comparison against constituent models is the primary use case.

-Future extensibility (e.g., visual version diffs) should be considered when evolving the format.



## Feedback and Suggestions
Any feedback or suggestion is welcomed

Loading
Loading