new ADRs for our model governance protocol by Sonhaf-prio · Pull Request #52 · views-platform/views-pipeline-core

Sonhaf-prio · 2025-12-10T10:16:16Z

As discussed in the VIEWS workshop in November, I created 3 ADR drafts that are based on @hhegre slides and the discussions we had. Please review the proposed ADRs. Some places mention either @hhegre or @Polichinel for additional feedback/input.

Things to keep in mind while reviewing:

Do the ADRs correspond to the logic we have agreed upon?
Is something missing
Are there updates regarding the metrics, especially the diversity score and the use of the second metric for probabilistic forecasts (MIS or Log Score)

documentation/ADRs/029_model_governance_point_predictions.md

hhegre · 2025-12-15T13:50:01Z

documentation/ADRs/029_model_governance_point_predictions.md

+
+A shadow ensemble **$S_i$** must:
+
+- Run for **≥ 3 consecutive months** and outperform **P** in the **Performance Comparison** and the **Diversity Requirement**


We are not replacing ensembles based on online evaluation, so the performance comparision should not be here

I still think, we need a stability requirement aka the shadow ensemble needs to run at least 2 months without irregularities.

Regarding the online eval, good catch, I added a "Once it is implemented disclaimer"

hhegre · 2025-12-15T13:51:55Z

documentation/ADRs/029_model_governance_point_predictions.md

+- How much better on **MSE**?
+- How large a difference in average prediction $\hat{\bar{y}}$ counts as “less conservative” in a meaningful way?
+- Same for diversity, if relevant.
+


I am not sure about the wisdom of a minimum improvement threshold, and think we need a bit of experience to arrive at a solution. Since we have two criteria (performance and diversity) having demanding thresholds here might make change impossible. At the same time, I am nervous about MSE. To me, the best remedy is to increase the size of the validation partition - use 36 sequences, not only 12. Then we are more certain that our metrics are stable and that minuscule improvement is still improvement.

I added these points in possible future extensions

documentation/ADRs/029_model_governance_point_predictions.md

hhegre · 2025-12-15T13:54:55Z

documentation/ADRs/029_model_governance_point_predictions.md

+- The distributions of predictions in comparison to the distribution of historical actuals. 
+
+Both of these should be added to the evaluation reports!
+


Add about here:

*Replacement of shadow ensembles if production ensemble is replaced

If a production ensemble is replaced by a shadow ensemble, the old production ensemble is relegated to shadow ensemble. The new production ensemble becomes the point of departure of all new shadow ensembles.

[We need to think how to structure this, and to get this logic into the model inclusion ADR. Maybe a good structure is that we have three primary shadow ensembles with different conservativity measures in place, and that these three are replaced when we replace the production ensemble. We can still keep old shadow ensembles in production for a while, say 12 months?]

I added the first part under the section general notes. The second point in [] belongs to the ADR 031 and will address it there. We should absolutely keep the replaced production ensemble in shadow mode (even if only as a role back). But I am not sure, if we can automatically also replace all 3 shadow ensembles when we switch the production ensemble, as this takes time for experimentation. Especially in busy times, this might take some time

hhegre · 2025-12-15T13:55:21Z

documentation/ADRs/029_model_governance_point_predictions.md

+Both of these should be added to the evaluation reports!
+
+
+## Consequences


Consequences of what?

This is the general template of the ADR. I interpreted it as consequences of this new ADR

hhegre · 2025-12-15T13:56:05Z

documentation/ADRs/029_model_governance_point_predictions.md

+This protocol represents a first MVP as the baseline for demotion and promotion guidelines. They are vital for VIEWS as an organization to ensure that changes are traceable, explainable, consistent and reproducible. Especially in the context of an operationalized pipeline, it is vital to have clear rules to set the tone for model development. The proposed metrics, additional health checks and the overall protocol should be reviewed monthly by an expert team to discuss positive and negative effects. These meetings are the basis for future updates of the protocol and its components and will rely heavily on the evaluation reports (both offline and online evaluation) to identify potential unwanted effects. Decisions to update any of the components need to be documented in ADRs. 
+
+## Potential Future Extensions
+One could also imagine 2 production ensembles: young buck vs old faithful. Young buck contains always the best predictions in terms of performance metrics while old faithful is more consistent over time. The **Rule 3 — Stability Requirement** would differ between the two. For young buck, the above stated 3 months could be enough while for old faithful we would at least require 12 months. This is an interesting addition but would require further discussions. 


I remain not fully convinced. Perhaps a pragmatic compromise is to keep the model that is the current production ensemble as the "old faithful" from now to the end of 2027, and let the new ensemble evolve in parallel. Then we can do an interesting retrospective evaluation two years from now, and demonstrate profgress?

I updated the comment to strengthen that we currently only have one production ensemble, and once replaced, we will run it in shadow in parallel. The young buck vs old faithful is a possible future extension that came out of the discussions with Mike this summer. Ultimately, any change to the current protocol has to be discussed by the expert team, so this is not a definite future extension, but I have rather included it for completeness.

documentation/ADRs/029_model_governance_point_predictions.md

documentation/ADRs/030_model_governance_probabilistic_predictions.md

hhegre · 2025-12-15T14:06:30Z

documentation/ADRs/030_model_governance_probabilistic_predictions.md

+- CRPS for S is lower than for P
+- MIS for S is at least equal or lower than for P (to do: HH wanted to talk with Mike regarding if the second metric should be MIS or Log Score)
+
+Suggestion (HH and Simon?):


I agree to the intention here, that we should take care to avoid conservative models. However, we may have better tools for this, e.g. test to compare the aggregate prediction distributions to the aggregate observed distributions; distributions that are good at covering the right-hand tails should also be less conservative.

I replaced it with your comment and included 2 possible metrics for the comparison of distributions. We could have 2 variants. 1. Compare the whole distribution, and 2. focus on the right-hand tail (which requires us to define a quantile threshold) . KS and Wasserstein are my initial suggestions, but I am open to other ideas.

documentation/ADRs/030_model_governance_probabilistic_predictions.md

documentation/ADRs/031_including_models_in_ensembles.md

hhegre · 2025-12-15T14:10:21Z

documentation/ADRs/030_including_models_in_ensembles.md

+
+3. Retain the strengths of our existing models while supporting stable, incremental innovation.
+
+4. Mitigate systematic under-prediction in a consistent and reliable manner.


This is strictly speaking covered by point 1 (evaluation metrics)

I agree, however, I wanted to stress the issue of underprediction in this protocol, as I believe it to be one of the most important issues, we should focus on. But if you feel strongly about it, I can delete point 4 ;)

documentation/ADRs/031_including_models_in_ensembles.md

hhegre · 2025-12-15T14:14:40Z

documentation/ADRs/031_including_models_in_ensembles.md

+
+**Rules**: 
+
+1. Ensure that all input data is part of the current data ingestion protocol. If a feature requires a additional ingestion routine, but does not improve the metrics above, consider dropping it to facilitate operational routines.


This needs to be rewritten: first, unclear which "metrics above" are referred to, moreover, if a model does not improve the shadow ensemble according to the metric, there is no dilemma. If it is worthy of inclusion in a shadow ensemble we will have to do a cost-benefit analysis. That should be informed by its contribution to the shadow ensemble's performance.

I updated the second sentence to reflect better dilemma.

hhegre · 2025-12-15T14:15:58Z

documentation/ADRs/030_including_models_in_ensembles.md

+- Different maximum number of models
+- Different values for $\tau_m$:
+    - 0 is acceptable for a non-conservative ensemble
+- Possible definition of $\tau_m$ is in terms of standard deviations of the metrics across the 12 (or so) time series/sequences that go into evaluation.


I have not been able to think of any better alternative - suggestions?

documentation/ADRs/030_including_models_in_ensembles.md

documentation/ADRs/031_including_models_in_ensembles.md

hhegre

I have added comments and suggested some changes.

smellycloud · 2026-01-27T10:38:06Z

Have there been any updates to this? If no, @Sonhaf-prio rebase and merge

new ADRs for our model governance protocol

c03eff1

Sonhaf-prio requested review from Polichinel and hhegre December 10, 2025 10:16