New scoring metrics for "cross_validation_across_time" by sianna338 · Pull Request #2 · skjerns/meg_utils

sianna338 · 2025-09-12T08:07:31Z

The function cross_validation_across_time can now use different scoring algorithms to determine classifier performance.
The parameter "metric" is added which accepts either a string keyword, refering to one of the scoring functions from sklearn.metrics, or alternatively a user-defined scoring function. Default is "accuracy".

gemini-code-assist

Summary of Changes

Hello @sianna338, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the cross_validation_across_time function by introducing a flexible mechanism for evaluating classifier performance. Instead of being limited to accuracy, users can now specify a wide range of scoring metrics, either from scikit-learn or custom-defined, providing greater analytical depth and adaptability for various machine learning tasks.

Highlights

Enhanced Scoring Metrics: The cross_validation_across_time function now supports various scoring algorithms for classifier performance evaluation.
Flexible Metric Specification: Users can specify metrics using a string keyword referring to sklearn.metrics functions (e.g., "average_precision_score") or by providing a custom user-defined scoring function.
New Parameters: New parameters metric, metric_kwargs, and proba have been added to control the scoring process, allowing for more granular control over how metrics are applied and what kind of input they receive.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the ability to use custom scoring metrics in cross_validation_across_time. The changes are a good step forward, but I've found a few critical issues in the implementation. The new logic for handling custom metrics has several bugs related to using the correct data for calculation and handling the new proba flag. Additionally, using a mutable dictionary as a default argument for metric_kwargs is unsafe. My review includes suggestions to fix these issues to ensure the new feature is robust and correct.

gemini-code-assist · 2025-09-12T08:11:32Z

decoding.py

        preds = np.argmax(probas, -1)
-
-        accuracy = (preds == test_y[:, None]).mean(axis=0)
+
+        if metric == "accuracy": 
+            accuracy = (preds == test_y[:, None]).mean(axis=0)
+        else:
+            # resolve metric function 
+            if isinstance(metric, str):
+                if not hasattr(sk_metrics, metric):
+                    raise ValueError(f"sklearn.metrics has no function named '{metric}'")
+                func = getattr(sk_metrics, metric)
+            elif callable(metric):
+                func = metric
+            else:
+                raise TypeError("metric must be 'accuracy', a sklearn.metrics name (str), or a callable.")
+
+            sig = inspect.signature(func)
+            # add any extra parameters that are not preds and data_y
+            if metric_kwargs:
+                kwargs = {}
+                for k, v in metric_kwargs.items():
+                    if k in sig.parameters:
+                        kwargs[k] = v
+
+            # need to loop over timepoints 
+            accuracy = np.zeros(time_max)
+            for t in list(range(0, time_max)): 
+                accuracy[t] = func(data_y, all_probas[:,t], **kwargs)


This block for calculating metrics contains several critical issues:

When proba=False, probas contains predictions, but np.argmax is still called, which is incorrect.

The custom metric calculation is incorrect: it uses the full data_y and all_probas instead of the current fold's test_y and probas.

There's a potential NameError for kwargs if metric_kwargs is not provided.

Additionally, the unchanged line all_probas[idxs_test] = probas (line 274) will fail with proba=False due to a shape mismatch and needs to be fixed separately.

Here is a suggested refactoring for this block to address the issues within the changed lines:

if proba: preds = np.argmax(probas, -1) else: preds = probas if metric == "accuracy": accuracy = (preds == test_y[:, None]).mean(axis=0) else: # resolve metric function if isinstance(metric, str): if not hasattr(sk_metrics, metric): raise ValueError(f"sklearn.metrics has no function named '{metric}'") func = getattr(sk_metrics, metric) elif callable(metric): func = metric else: raise TypeError("metric must be 'accuracy', a sklearn.metrics name (str), or a callable.") # add any extra parameters that are not preds and data_y kwargs = {} if metric_kwargs: sig = inspect.signature(func) for k, v in metric_kwargs.items(): if k in sig.parameters: kwargs[k] = v # need to loop over timepoints accuracy = np.zeros(time_max) for t in range(time_max): accuracy[t] = func(test_y, probas[:,t], **kwargs)

gemini-code-assist · 2025-09-12T08:11:32Z

decoding.py

                                 tmin=-0.1, tmax=0.5, sfreq=100,
-                                 return_probas=False,
+                                 return_probas=False, metric='accuracy',
+                                 metric_kwargs={}, proba=True, 


Using a mutable object like a dictionary as a default argument is discouraged. If the dictionary is modified, the change will persist across calls. It's safer to default to None and create a new dictionary inside the function if needed. You would then need to add if metric_kwargs is None: metric_kwargs = {} at the beginning of the function body.

Suggested change

metric_kwargs={}, proba=True,

metric_kwargs=None, proba=True,

decoding.py

add error message when given metric_kwargs don't match the input arguments required by the scoring function Co-authored-by: Simon Kern <14980558+skjerns@users.noreply.github.com>

input to function, default scoring function (top_1_accuracy)

skjerns · 2025-09-22T10:07:09Z

@sianna338 let me know when I should review again, and please comment and/or mark the conversations above as "resolved", so I know that it has been taken care of

…ion for the classifier needs

…equires

skjerns · 2025-10-02T11:24:55Z

/gemini review

skjerns · 2025-10-02T11:25:04Z

/gemini summary

gemini-code-assist

Code Review

This pull request introduces a flexible scoring mechanism to the cross_validation_across_time function, which is a great enhancement. However, the current implementation contains several critical bugs in the scoring logic within the cross-validation loop, which will lead to incorrect results. Specifically, the scoring is performed on the entire dataset instead of the hold-out test set for each fold. There are also some API design issues, such as a now-unused parameter, and minor code quality issues like duplicate imports. I've provided detailed comments and suggestions to address these critical issues and improve the overall code quality.

gemini-code-assist · 2025-10-02T11:26:39Z

decoding.py

+        if not needs_probas:
+           preds_idx = np.argmax(all_results, axis=2)              
+           preds_lbl = labels[preds_idx] 
+
+
+        # compute scoring metric 
+        score = np.zeros(time_max)
+        for t in list(range(0, time_max)): 
+            if needs_probas:
+                score[t] = func(data_y, all_results[:,t], **metric_kwargs)
+            else:
+                score[t] = func(data_y, preds_lbl[:,t], **metric_kwargs)


This block of code contains several critical bugs that will lead to incorrect results or runtime errors:

UnboundVariable: preds_lbl is defined within an if not needs_probas: block but is used later in a separate else block. If needs_probas is True, this will raise an UnboundLocalError.

Incorrect Data Scope for Predictions: Predictions are generated using np.argmax(all_results, ...). The all_results array contains data from all folds processed so far, not just the current test fold. Predictions should be based on results_swp, which holds the probabilities for the current fold.

Incorrect Scoring Data: The scoring function func is called with data_y (all labels) and all_results. Scoring in cross-validation must be performed on the hold-out test set for the current fold, using data_y[idxs_test] and the corresponding predictions/probabilities from results_swp.

The suggested change restructures the logic to fix these issues by correctly scoping data to the current fold.

Suggested change

if not needs_probas:

preds_idx = np.argmax(all_results, axis=2)

preds_lbl = labels[preds_idx]

# compute scoring metric

score = np.zeros(time_max)

for t in list(range(0, time_max)):

if needs_probas:

score[t] = func(data_y, all_results[:,t], **metric_kwargs)

else:

score[t] = func(data_y, preds_lbl[:,t], **metric_kwargs)

test_y = data_y[idxs_test]

score = np.zeros(time_max)

if needs_probas:

# Score using probabilities for each time point

for t in range(time_max):

score[t] = func(test_y, results_swp[:, t], **metric_kwargs)

else:

# Convert probabilities to label predictions once

preds_idx = np.argmax(results_swp, axis=2)

preds_lbl = labels[preds_idx]

# Score using predictions for each time point

for t in range(time_max):

score[t] = func(test_y, preds_lbl[:, t], **metric_kwargs)

.gitignore

decoding.py

gemini-code-assist · 2025-10-02T11:26:59Z

Summary of Changes

This pull request significantly enhances the cross_validation_across_time function by introducing a new metric parameter, enabling users to specify a variety of scoring algorithms from sklearn.metrics or provide custom functions. This change greatly improves the flexibility and utility of the cross-validation process, allowing for more comprehensive evaluation of classifier performance beyond the default accuracy metric. The update also includes dynamic handling of metric input types (probabilities vs. predictions) and richer output in the results DataFrame.

Highlights

Flexible Scoring Metrics: The cross_validation_across_time function now supports a flexible metric parameter, allowing users to specify any scoring function from sklearn.metrics (by string name) or a custom callable function for evaluating classifier performance.
Dynamic Metric Handling: The function intelligently determines whether the chosen scoring metric expects probabilities or predicted labels as input, adapting its internal processing accordingly.
Enhanced Output: The output DataFrame now includes the score (instead of just accuracy), the metric_used, and metric_kwargs for better traceability and understanding of the evaluation results.
Codebase Refinements: Minor changes include updating .gitignore and streamlining imports within decoding.py.

Changelog

.gitignore
- Added decoding_test-1.py and prepare_dat.py to the ignore list.
decoding.py
- Imported misc, numpy as np, and sklearn.metrics as sk_metrics.
- Modified the cross_validation_across_time function signature to include return_probas=True, metric='accuracy', and metric_kwargs={}.
- Updated the docstring for cross_validation_across_time to describe the new metric and metric_kwargs parameters.
- Removed a commented-out warning.
- Renamed all_probas to all_results and adjusted its initialization.
- Refactored the scoring logic within the cross-validation loop to dynamically determine and apply the specified metric function, handling both probability-based and prediction-based metrics.
- Updated the temporary DataFrame (df_temp) to store score, metric_used, and metric_kwargs instead of accuracy.
- Changed the return statement to always return (df, all_results).

Activity

skjerns requested a review from sianna338 and asked to resolve conversations.
skjerns initiated a gemini review.
skjerns requested a gemini summary.
gemini-code-assist[bot] provided a critical review comment on incorrect metric calculation, np.argmax usage, and NameError for kwargs, suggesting a refactoring.
gemini-code-assist[bot] provided a high-priority review comment about using mutable default arguments for metric_kwargs.
skjerns questioned the redundancy of proba and return_proba parameters.
sianna338 agreed that the proba and return_proba parameters were redundant.
skjerns asked if proba should always be True.
sianna338 suggested proba=return_proba.
sianna338 explained that some sklearn.metrics functions expect predicted labels, not probabilities.
skjerns suggested looking into sklearn.metrics.get_scorer and make_scorer for a more elegant solution.
sianna338 investigated get_scorer and make_scorer but found they don't fit this specific use case.
skjerns suggested using top_k_accuracy_score(k=1) for cleaner accuracy calculation.
skjerns suggested renaming accuracy to score in the DataFrame.
sianna338 provided a suggestion for renaming to score.
skjerns urged sianna338 to implement suggestions locally.
sianna338 clarified they were just trying out the suggestion feature.
skjerns suggested raising an error if metric_kwargs contains parameters not part of the function signature.
skjerns suggested passing metric_kwargs directly after resolving previous comments.
skjerns suggested changing the metric name in the DataFrame and adding metric_kwargs.
sianna338 provided a suggestion for metric_kwargs={}.
gemini-code-assist[bot] provided a critical review comment on UnboundVariable for preds_lbl, incorrect data scope for predictions, and incorrect scoring data, suggesting a refactoring.
gemini-code-assist[bot] provided a high-priority review comment on the detection logic for probability/prediction requirements, suggesting raising a ValueError instead of printing a message.
gemini-code-assist[bot] provided a medium-priority review comment on adding temporary files to .gitignore, suggesting they should be in a global gitignore.
gemini-code-assist[bot] provided a medium-priority review comment on duplicate numpy import and unnecessary complexity in misc import.

decoding.py

sianna338 added 2 commits September 12, 2025 09:45

Update cross_validation_across_time to use different scoring functions

dccc216

cross_validation_across_time: allow self-defined metric function

e3c354b

gemini-code-assist bot reviewed Sep 12, 2025

View reviewed changes