new ADRs for our model governance protocol#52
new ADRs for our model governance protocol#52Sonhaf-prio wants to merge 3 commits intodevelopmentfrom
Conversation
|
|
||
| A shadow ensemble **$S_i$** must: | ||
|
|
||
| - Run for **≥ 3 consecutive months** and outperform **P** in the **Performance Comparison** and the **Diversity Requirement** |
There was a problem hiding this comment.
We are not replacing ensembles based on online evaluation, so the performance comparision should not be here
There was a problem hiding this comment.
I still think, we need a stability requirement aka the shadow ensemble needs to run at least 2 months without irregularities.
Regarding the online eval, good catch, I added a "Once it is implemented disclaimer"
| - How much better on **MSE**? | ||
| - How large a difference in average prediction $\hat{\bar{y}}$ counts as “less conservative” in a meaningful way? | ||
| - Same for diversity, if relevant. | ||
|
|
There was a problem hiding this comment.
I am not sure about the wisdom of a minimum improvement threshold, and think we need a bit of experience to arrive at a solution. Since we have two criteria (performance and diversity) having demanding thresholds here might make change impossible. At the same time, I am nervous about MSE. To me, the best remedy is to increase the size of the validation partition - use 36 sequences, not only 12. Then we are more certain that our metrics are stable and that minuscule improvement is still improvement.
There was a problem hiding this comment.
I added these points in possible future extensions
| - The distributions of predictions in comparison to the distribution of historical actuals. | ||
|
|
||
| Both of these should be added to the evaluation reports! | ||
|
|
There was a problem hiding this comment.
Add about here:
*Replacement of shadow ensembles if production ensemble is replaced
If a production ensemble is replaced by a shadow ensemble, the old production ensemble is relegated to shadow ensemble. The new production ensemble becomes the point of departure of all new shadow ensembles.
[We need to think how to structure this, and to get this logic into the model inclusion ADR. Maybe a good structure is that we have three primary shadow ensembles with different conservativity measures in place, and that these three are replaced when we replace the production ensemble. We can still keep old shadow ensembles in production for a while, say 12 months?]
There was a problem hiding this comment.
I added the first part under the section general notes. The second point in [] belongs to the ADR 031 and will address it there. We should absolutely keep the replaced production ensemble in shadow mode (even if only as a role back). But I am not sure, if we can automatically also replace all 3 shadow ensembles when we switch the production ensemble, as this takes time for experimentation. Especially in busy times, this might take some time
| Both of these should be added to the evaluation reports! | ||
|
|
||
|
|
||
| ## Consequences |
There was a problem hiding this comment.
This is the general template of the ADR. I interpreted it as consequences of this new ADR
| This protocol represents a first MVP as the baseline for demotion and promotion guidelines. They are vital for VIEWS as an organization to ensure that changes are traceable, explainable, consistent and reproducible. Especially in the context of an operationalized pipeline, it is vital to have clear rules to set the tone for model development. The proposed metrics, additional health checks and the overall protocol should be reviewed monthly by an expert team to discuss positive and negative effects. These meetings are the basis for future updates of the protocol and its components and will rely heavily on the evaluation reports (both offline and online evaluation) to identify potential unwanted effects. Decisions to update any of the components need to be documented in ADRs. | ||
|
|
||
| ## Potential Future Extensions | ||
| One could also imagine 2 production ensembles: young buck vs old faithful. Young buck contains always the best predictions in terms of performance metrics while old faithful is more consistent over time. The **Rule 3 — Stability Requirement** would differ between the two. For young buck, the above stated 3 months could be enough while for old faithful we would at least require 12 months. This is an interesting addition but would require further discussions. |
There was a problem hiding this comment.
I remain not fully convinced. Perhaps a pragmatic compromise is to keep the model that is the current production ensemble as the "old faithful" from now to the end of 2027, and let the new ensemble evolve in parallel. Then we can do an interesting retrospective evaluation two years from now, and demonstrate profgress?
There was a problem hiding this comment.
I updated the comment to strengthen that we currently only have one production ensemble, and once replaced, we will run it in shadow in parallel. The young buck vs old faithful is a possible future extension that came out of the discussions with Mike this summer. Ultimately, any change to the current protocol has to be discussed by the expert team, so this is not a definite future extension, but I have rather included it for completeness.
documentation/ADRs/030_model_governance_probabilistic_predictions.md
Outdated
Show resolved
Hide resolved
documentation/ADRs/030_model_governance_probabilistic_predictions.md
Outdated
Show resolved
Hide resolved
documentation/ADRs/030_model_governance_probabilistic_predictions.md
Outdated
Show resolved
Hide resolved
| - CRPS for S is lower than for P | ||
| - MIS for S is at least equal or lower than for P (to do: HH wanted to talk with Mike regarding if the second metric should be MIS or Log Score) | ||
|
|
||
| Suggestion (HH and Simon?): |
There was a problem hiding this comment.
I agree to the intention here, that we should take care to avoid conservative models. However, we may have better tools for this, e.g. test to compare the aggregate prediction distributions to the aggregate observed distributions; distributions that are good at covering the right-hand tails should also be less conservative.
There was a problem hiding this comment.
I replaced it with your comment and included 2 possible metrics for the comparison of distributions. We could have 2 variants. 1. Compare the whole distribution, and 2. focus on the right-hand tail (which requires us to define a quantile threshold) . KS and Wasserstein are my initial suggestions, but I am open to other ideas.
documentation/ADRs/030_model_governance_probabilistic_predictions.md
Outdated
Show resolved
Hide resolved
|
|
||
| 3. Retain the strengths of our existing models while supporting stable, incremental innovation. | ||
|
|
||
| 4. Mitigate systematic under-prediction in a consistent and reliable manner. |
There was a problem hiding this comment.
This is strictly speaking covered by point 1 (evaluation metrics)
There was a problem hiding this comment.
I agree, however, I wanted to stress the issue of underprediction in this protocol, as I believe it to be one of the most important issues, we should focus on. But if you feel strongly about it, I can delete point 4 ;)
|
|
||
| **Rules**: | ||
|
|
||
| 1. Ensure that all input data is part of the current data ingestion protocol. If a feature requires a additional ingestion routine, but does not improve the metrics above, consider dropping it to facilitate operational routines. |
There was a problem hiding this comment.
This needs to be rewritten: first, unclear which "metrics above" are referred to, moreover, if a model does not improve the shadow ensemble according to the metric, there is no dilemma. If it is worthy of inclusion in a shadow ensemble we will have to do a cost-benefit analysis. That should be informed by its contribution to the shadow ensemble's performance.
There was a problem hiding this comment.
I updated the second sentence to reflect better dilemma.
| - Different maximum number of models | ||
| - Different values for $\tau_m$: | ||
| - 0 is acceptable for a non-conservative ensemble | ||
| - Possible definition of $\tau_m$ is in terms of standard deviations of the metrics across the 12 (or so) time series/sequences that go into evaluation. |
There was a problem hiding this comment.
I have not been able to think of any better alternative - suggestions?
hhegre
left a comment
There was a problem hiding this comment.
I have added comments and suggested some changes.
|
Have there been any updates to this? If no, @Sonhaf-prio rebase and merge |
As discussed in the VIEWS workshop in November, I created 3 ADR drafts that are based on @hhegre slides and the discussions we had. Please review the proposed ADRs. Some places mention either @hhegre or @Polichinel for additional feedback/input.
Things to keep in mind while reviewing: