Skip to content

new ADRs for our model governance protocol#52

Open
Sonhaf-prio wants to merge 3 commits intodevelopmentfrom
adr/model_governance
Open

new ADRs for our model governance protocol#52
Sonhaf-prio wants to merge 3 commits intodevelopmentfrom
adr/model_governance

Conversation

@Sonhaf-prio
Copy link
Contributor

As discussed in the VIEWS workshop in November, I created 3 ADR drafts that are based on @hhegre slides and the discussions we had. Please review the proposed ADRs. Some places mention either @hhegre or @Polichinel for additional feedback/input.

Things to keep in mind while reviewing:

  • Do the ADRs correspond to the logic we have agreed upon?
  • Is something missing
  • Are there updates regarding the metrics, especially the diversity score and the use of the second metric for probabilistic forecasts (MIS or Log Score)


A shadow ensemble **$S_i$** must:

- Run for **≥ 3 consecutive months** and outperform **P** in the **Performance Comparison** and the **Diversity Requirement**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not replacing ensembles based on online evaluation, so the performance comparision should not be here

Copy link
Contributor Author

@Sonhaf-prio Sonhaf-prio Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think, we need a stability requirement aka the shadow ensemble needs to run at least 2 months without irregularities.

Regarding the online eval, good catch, I added a "Once it is implemented disclaimer"

- How much better on **MSE**?
- How large a difference in average prediction $\hat{\bar{y}}$ counts as “less conservative” in a meaningful way?
- Same for diversity, if relevant.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the wisdom of a minimum improvement threshold, and think we need a bit of experience to arrive at a solution. Since we have two criteria (performance and diversity) having demanding thresholds here might make change impossible. At the same time, I am nervous about MSE. To me, the best remedy is to increase the size of the validation partition - use 36 sequences, not only 12. Then we are more certain that our metrics are stable and that minuscule improvement is still improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these points in possible future extensions

- The distributions of predictions in comparison to the distribution of historical actuals.

Both of these should be added to the evaluation reports!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add about here:

*Replacement of shadow ensembles if production ensemble is replaced

If a production ensemble is replaced by a shadow ensemble, the old production ensemble is relegated to shadow ensemble. The new production ensemble becomes the point of departure of all new shadow ensembles.

[We need to think how to structure this, and to get this logic into the model inclusion ADR. Maybe a good structure is that we have three primary shadow ensembles with different conservativity measures in place, and that these three are replaced when we replace the production ensemble. We can still keep old shadow ensembles in production for a while, say 12 months?]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the first part under the section general notes. The second point in [] belongs to the ADR 031 and will address it there. We should absolutely keep the replaced production ensemble in shadow mode (even if only as a role back). But I am not sure, if we can automatically also replace all 3 shadow ensembles when we switch the production ensemble, as this takes time for experimentation. Especially in busy times, this might take some time

Both of these should be added to the evaluation reports!


## Consequences
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consequences of what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the general template of the ADR. I interpreted it as consequences of this new ADR

This protocol represents a first MVP as the baseline for demotion and promotion guidelines. They are vital for VIEWS as an organization to ensure that changes are traceable, explainable, consistent and reproducible. Especially in the context of an operationalized pipeline, it is vital to have clear rules to set the tone for model development. The proposed metrics, additional health checks and the overall protocol should be reviewed monthly by an expert team to discuss positive and negative effects. These meetings are the basis for future updates of the protocol and its components and will rely heavily on the evaluation reports (both offline and online evaluation) to identify potential unwanted effects. Decisions to update any of the components need to be documented in ADRs.

## Potential Future Extensions
One could also imagine 2 production ensembles: young buck vs old faithful. Young buck contains always the best predictions in terms of performance metrics while old faithful is more consistent over time. The **Rule 3 — Stability Requirement** would differ between the two. For young buck, the above stated 3 months could be enough while for old faithful we would at least require 12 months. This is an interesting addition but would require further discussions.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remain not fully convinced. Perhaps a pragmatic compromise is to keep the model that is the current production ensemble as the "old faithful" from now to the end of 2027, and let the new ensemble evolve in parallel. Then we can do an interesting retrospective evaluation two years from now, and demonstrate profgress?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment to strengthen that we currently only have one production ensemble, and once replaced, we will run it in shadow in parallel. The young buck vs old faithful is a possible future extension that came out of the discussions with Mike this summer. Ultimately, any change to the current protocol has to be discussed by the expert team, so this is not a definite future extension, but I have rather included it for completeness.

- CRPS for S is lower than for P
- MIS for S is at least equal or lower than for P (to do: HH wanted to talk with Mike regarding if the second metric should be MIS or Log Score)

Suggestion (HH and Simon?):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to the intention here, that we should take care to avoid conservative models. However, we may have better tools for this, e.g. test to compare the aggregate prediction distributions to the aggregate observed distributions; distributions that are good at covering the right-hand tails should also be less conservative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced it with your comment and included 2 possible metrics for the comparison of distributions. We could have 2 variants. 1. Compare the whole distribution, and 2. focus on the right-hand tail (which requires us to define a quantile threshold) . KS and Wasserstein are my initial suggestions, but I am open to other ideas.


3. Retain the strengths of our existing models while supporting stable, incremental innovation.

4. Mitigate systematic under-prediction in a consistent and reliable manner.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strictly speaking covered by point 1 (evaluation metrics)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, however, I wanted to stress the issue of underprediction in this protocol, as I believe it to be one of the most important issues, we should focus on. But if you feel strongly about it, I can delete point 4 ;)


**Rules**:

1. Ensure that all input data is part of the current data ingestion protocol. If a feature requires a additional ingestion routine, but does not improve the metrics above, consider dropping it to facilitate operational routines.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be rewritten: first, unclear which "metrics above" are referred to, moreover, if a model does not improve the shadow ensemble according to the metric, there is no dilemma. If it is worthy of inclusion in a shadow ensemble we will have to do a cost-benefit analysis. That should be informed by its contribution to the shadow ensemble's performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the second sentence to reflect better dilemma.

- Different maximum number of models
- Different values for $\tau_m$:
- 0 is acceptable for a non-conservative ensemble
- Possible definition of $\tau_m$ is in terms of standard deviations of the metrics across the 12 (or so) time series/sequences that go into evaluation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not been able to think of any better alternative - suggestions?

Copy link

@hhegre hhegre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added comments and suggested some changes.

@smellycloud
Copy link
Collaborator

Have there been any updates to this? If no, @Sonhaf-prio rebase and merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants