-
Notifications
You must be signed in to change notification settings - Fork 58
fix(pre-launch science review): changing config, default behaviour fo… #138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…r aggregation strategy, introducing unknown ratio measure, splitting also considers \n BREAKING CHANGES: ClassificationAccuracyConfig changing fields, introducing 2 more fields
…r aggregation strategy, introducing unknown ratio measure, splitting also considers \n BREAKING CHANGES: ClassificationAccuracyConfig changing fields, introducing 2 more fields
| BALANCED_ACCURACY_SCORE = "balanced_accuracy_score" | ||
| PRECISION_SCORE = "precision_score" | ||
| RECALL_SCORE = "recall_score" | ||
| UNKNOWN_RATIO = "unknown_ratio" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't add new scores right now. We are very close to launch, this can break integration tests for us and clients.
Can we please stick to bug fixes and refactoring only? new scores should be added post launch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure we can put this one later on
| PREDICTED_CLASS_COLUMN_NAME = "predicted_class" | ||
|
|
||
|
|
||
| def unknown_ratio(y_pred) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add type annotations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think type annotations are tricky in these kinds of use cases where the argument is best described as a "1D array-like" (see sklearn docs for example).
I supposed we could do something like Union[pandas.DataFrame, pandas.Series, numpy.array, List[Any]], but that would be quite clunky, and not even necessarily exhaustive.
| valid_labels: Optional[List[str]] = None | ||
| converter_fn: Callable[[str, List[str]], str] = convert_model_output_to_label | ||
| multiclass_average_strategy: Optional[str] = "micro" | ||
| binary_average_strategy: Optional[str] = "binary" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you update this, you have to update ClassificationAccuracySemanticRobustnessConfig as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a variable for binary_average_strategy="binary"? It can be hardcoded since there is no other options for the binary case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keerthanvasist ok will do.
@polaschwoebel no, all other options are valid as well for the binary case and result in different computations, so they still make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I doubt that others will be used much -- "binary" is very much the standard behavior as we discussed offline -- but if we can add this new parameter to the config without too much trouble it's of course more flexible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much agree with Pola; "binary" is pretty much the only option that makes sense here. I suppose giving the user freedom to choose other options is fine, since we've configured the default appropriately. If they choose something like "micro" and are confused why precision and recall are always the same, that's a self-inflicted customer error.
| converter_fn: Callable[[str, List[str]], str] = convert_model_output_to_label | ||
| multiclass_average_strategy: Optional[str] = "micro" | ||
| binary_average_strategy: Optional[str] = "binary" | ||
| positive_label: str = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also concerned with this default. I am worried this won't play well with valid_labels. Let's discuss on Monday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. What are you concerned about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Keerthan, positive_label should be one of the valid_labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes true, but the problem is that if we autodetect the valid labels, I don't think the order of the resulting list is going to be deterministinc (especially when we do downsampling). So.. if we pick, say, the first then we might pick a label depending on the random seed which is also not that good.
What I can do is to leave positive_label here, and if positive_label is not in valid_labels then I pick the first of valid_labels and print a warning. Would this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so!
keerthanvasist
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments to come. We should discuss the changes, and we should not merge this new score now.
| valid_labels = [label.lower().strip() for label in valid_labels] | ||
|
|
||
| response_words = model_output.split(" ") | ||
| response_words = re.split(r"[\n\s+]", model_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not tested as far as I can tell, and I am in doubt about what it does and how it relates to batching. Can you add a unit test please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a standard regular expression. It matches new lines \n or one or more spaces \s+. The [...] indicates set matching (which is same as a logical or among characters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are anticipating new lines within a single answer? Are you imagining a case where the model replies something like this?
The answer is
2
I still think a unit test could be a good way to show what we are matching here, could just be one more test case in test_convert_model_output_to_label.
|
Hi all, we need to converge on this. We can:
Let me know what you prefer. |
| PREDICTED_CLASS_COLUMN_NAME = "predicted_class" | ||
|
|
||
|
|
||
| def unknown_ratio(y_pred) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps rename this to unknown_fraction instead?
| ) | ||
| return eval_outputs | ||
|
|
||
| def _get_score(self, y_true, y_pred, eval_fn: Callable[..., float]) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add unit tests for _get_score for the various cases (binary recall, binary precision, multiclass recall, multiclass precision, etc).
| EvalScore(name=UNKNOWN_RATIO, value=0.0), | ||
| ] | ||
|
|
||
| DATASET_SCORES_WO_CONFIG = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what this represents? I see that you replaced all instances of DATASET_SCORES with DATASET_SCORES_WO_CONFIG. Shouldn't we still have some tests that use DATASET_SCORES? Also, could you please come up with a case where the expected precision and recall scores are different?
changing config, default behavior for aggregation strategy, introducing unknown ratio measure, splitting also considers \n
BREAKING CHANGES: ClassificationAccuracyConfig changing fields, introducing 2 more fields and renaming one
Description of changes:
re.splitfor default converter function. reason: some models do not to put white spaces before/after new line, causing parsing to miss many valid outputsbinary_average_strategyandpositive_labelto handle binary classification case. I changed the default of multiclass_average_strategy asmicrois non-standard.unknownBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.