Skip to content

Conversation

@sfluegel05
Copy link
Collaborator

To use the chebai-models for the chebifier-ensemble, I have made some minor changes:

  • Split ratios can be set independently for validation and test set - this allows e.g. for a larger validation set which we need to calculate metrics or (potentially) do some ensemble-training
  • Prediction smoothing is now more general - we use this in the ensemble to ensure all predictions are consistent

@sfluegel05 sfluegel05 merged commit 1e2a043 into dev Jul 10, 2025
2 of 6 checks passed
@sfluegel05 sfluegel05 deleted the feature-better-split-ratios branch July 10, 2025 10:28
@aditya0by0
Copy link
Member

@sfluegel05,

Before this PR, with the default values,

  • Test split: 12.75% of the data
  • Validation split: ~2.309% of the data
  • Train split: ~84.94% of the data

After this PR, with the new default values,

  • Test split: 10% of the data
  • Validation split: ~5.004% of the data
  • Train split: ~84.9999% of the data

Shouldn’t we adjust the new default values to maintain the same data distribution as before the change?

@sfluegel05
Copy link
Collaborator Author

Well, changing the default values kind of was my intention. I don't think there has ever been an actual reason for the validation set being 15% of 15%. I can't think of a scenario where it is important to have the same split sizes but where you are not re-using the actual splits.
The advantage of the new ratio is that you get more validation data (for ChEBI50, the smallest classes have only 50 members, meaning 2.3% equates to 1 positive sample in the validation set, which is not that expressive).

@aditya0by0
Copy link
Member

#99 (comment)
#97 (comment)

The unit-test failing issue originates from this PR, which was merged into dev even though its test cases were failing at the time. The root cause is that self.train_split was replaced with self.test_split and self.validation_split.
In the test case, self.train_split was used to set a specific data distribution, so this breaks after the parameter changes introduced in that PR.
I’ll adjust the tests to use the new parameters accordingly.

@aditya0by0 aditya0by0 restored the feature-better-split-ratios branch July 11, 2025 14:03
@sfluegel05
Copy link
Collaborator Author

Thanks, that is my bad. Somehow I missed that the tests were failing for this PR.

@sfluegel05 sfluegel05 changed the title Split ration and inconsistency removal for ensemble Split ratio and inconsistency removal for ensemble Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants