Skip to content

Conversation

@franluca
Copy link
Contributor

@franluca franluca commented Nov 24, 2023

changing config, default behavior for aggregation strategy, introducing unknown ratio measure, splitting also considers \n

BREAKING CHANGES: ClassificationAccuracyConfig changing fields, introducing 2 more fields and renaming one

Description of changes:

  • (minor) Renamed column name for model output after converter function. reason: previous name was non-standard/misleading
  • (minor) added splitting wrt \n with re.split for default converter function. reason: some models do not to put white spaces before/after new line, causing parsing to miss many valid outputs
  • (major) reworked aggregation strategies for precision and recall. reason: previous default behavior was non-standard and did not consider differences of binary vs multiclass setting.
    • changes to ClassificationAccuracyConfig. To implement correct behaviour, the cleanest way is to change config. I added binary_average_strategy and positive_label to handle binary classification case. I changed the default of multiclass_average_strategy as micro is non-standard.
    • precision and recall are now computed only on the subset of records that are not classified as unknown
  • (major) added a further metrics that computes the ratio of unknown labels to compliment new behavior of precision and recall computation. It is fine if this measure is not shown in the report for now. We can follow up on this
  • (minor) changed some docstrings

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…r aggregation strategy, introducing unknown ratio measure, splitting also considers \n

BREAKING CHANGES: ClassificationAccuracyConfig changing fields, introducing 2 more fields
…r aggregation strategy, introducing unknown ratio measure, splitting also considers \n

BREAKING CHANGES: ClassificationAccuracyConfig changing fields, introducing 2 more fields
BALANCED_ACCURACY_SCORE = "balanced_accuracy_score"
PRECISION_SCORE = "precision_score"
RECALL_SCORE = "recall_score"
UNKNOWN_RATIO = "unknown_ratio"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't add new scores right now. We are very close to launch, this can break integration tests for us and clients.

Can we please stick to bug fixes and refactoring only? new scores should be added post launch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we can put this one later on

PREDICTED_CLASS_COLUMN_NAME = "predicted_class"


def unknown_ratio(y_pred) -> float:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add type annotations

Copy link
Contributor

@danielezhu danielezhu Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think type annotations are tricky in these kinds of use cases where the argument is best described as a "1D array-like" (see sklearn docs for example).

I supposed we could do something like Union[pandas.DataFrame, pandas.Series, numpy.array, List[Any]], but that would be quite clunky, and not even necessarily exhaustive.

valid_labels: Optional[List[str]] = None
converter_fn: Callable[[str, List[str]], str] = convert_model_output_to_label
multiclass_average_strategy: Optional[str] = "micro"
binary_average_strategy: Optional[str] = "binary"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you update this, you have to update ClassificationAccuracySemanticRobustnessConfig as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a variable for binary_average_strategy="binary"? It can be hardcoded since there is no other options for the binary case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keerthanvasist ok will do.
@polaschwoebel no, all other options are valid as well for the binary case and result in different computations, so they still make sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I doubt that others will be used much -- "binary" is very much the standard behavior as we discussed offline -- but if we can add this new parameter to the config without too much trouble it's of course more flexible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much agree with Pola; "binary" is pretty much the only option that makes sense here. I suppose giving the user freedom to choose other options is fine, since we've configured the default appropriately. If they choose something like "micro" and are confused why precision and recall are always the same, that's a self-inflicted customer error.

converter_fn: Callable[[str, List[str]], str] = convert_model_output_to_label
multiclass_average_strategy: Optional[str] = "micro"
binary_average_strategy: Optional[str] = "binary"
positive_label: str = "1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also concerned with this default. I am worried this won't play well with valid_labels. Let's discuss on Monday.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. What are you concerned about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Keerthan, positive_label should be one of the valid_labels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes true, but the problem is that if we autodetect the valid labels, I don't think the order of the resulting list is going to be deterministinc (especially when we do downsampling). So.. if we pick, say, the first then we might pick a label depending on the random seed which is also not that good.

What I can do is to leave positive_label here, and if positive_label is not in valid_labels then I pick the first of valid_labels and print a warning. Would this work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so!

Copy link
Contributor

@keerthanvasist keerthanvasist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments to come. We should discuss the changes, and we should not merge this new score now.

valid_labels = [label.lower().strip() for label in valid_labels]

response_words = model_output.split(" ")
response_words = re.split(r"[\n\s+]", model_output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not tested as far as I can tell, and I am in doubt about what it does and how it relates to batching. Can you add a unit test please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a standard regular expression. It matches new lines \n or one or more spaces \s+. The [...] indicates set matching (which is same as a logical or among characters).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are anticipating new lines within a single answer? Are you imagining a case where the model replies something like this?

The answer is 
2

I still think a unit test could be a good way to show what we are matching here, could just be one more test case in test_convert_model_output_to_label.

@franluca
Copy link
Contributor Author

Hi all, we need to converge on this. We can:

  1. leave everything as is and wait to make changes up until next week
  2. I can remove the new score, and add the default handling of positive_label as I wrote above (and propagate changes to semantic robustness)
  3. ENGs can pick this up. The main point is that micro is not a good default. We want to have binary when the task is binary classification and weighted when is multi-class classification

Let me know what you prefer.

PREDICTED_CLASS_COLUMN_NAME = "predicted_class"


def unknown_ratio(y_pred) -> float:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rename this to unknown_fraction instead?

)
return eval_outputs

def _get_score(self, y_true, y_pred, eval_fn: Callable[..., float]) -> float:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add unit tests for _get_score for the various cases (binary recall, binary precision, multiclass recall, multiclass precision, etc).

EvalScore(name=UNKNOWN_RATIO, value=0.0),
]

DATASET_SCORES_WO_CONFIG = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this represents? I see that you replaced all instances of DATASET_SCORES with DATASET_SCORES_WO_CONFIG. Shouldn't we still have some tests that use DATASET_SCORES? Also, could you please come up with a case where the expected precision and recall scores are different?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants