Split ratio and inconsistency removal for ensemble #102

sfluegel05 · 2025-07-10T10:28:17Z

To use the chebai-models for the chebifier-ensemble, I have made some minor changes:

Split ratios can be set independently for validation and test set - this allows e.g. for a larger validation set which we need to calculate metrics or (potentially) do some ensemble-training
Prediction smoothing is now more general - we use this in the ensemble to ensure all predictions are consistent

aditya0by0 · 2025-07-10T19:13:14Z

@sfluegel05,

Before this PR, with the default values,

Test split: 12.75% of the data
Validation split: ~2.309% of the data
Train split: ~84.94% of the data

After this PR, with the new default values,

Test split: 10% of the data
Validation split: ~5.004% of the data
Train split: ~84.9999% of the data

Shouldn’t we adjust the new default values to maintain the same data distribution as before the change?

sfluegel05 · 2025-07-11T09:20:18Z

Well, changing the default values kind of was my intention. I don't think there has ever been an actual reason for the validation set being 15% of 15%. I can't think of a scenario where it is important to have the same split sizes but where you are not re-using the actual splits.
The advantage of the new ratio is that you get more validation data (for ChEBI50, the smallest classes have only 50 members, meaning 2.3% equates to 1 positive sample in the validation set, which is not that expressive).

aditya0by0 · 2025-07-11T13:13:13Z

#99 (comment)
#97 (comment)

The unit-test failing issue originates from this PR, which was merged into dev even though its test cases were failing at the time. The root cause is that self.train_split was replaced with self.test_split and self.validation_split.
In the test case, self.train_split was used to set a specific data distribution, so this breaks after the parameter changes introduced in that PR.
I’ll adjust the tests to use the new parameters accordingly.

sfluegel05 · 2025-07-14T12:08:41Z

Thanks, that is my bad. Somehow I missed that the tests were failing for this PR.

sfluegel05 added 4 commits June 24, 2025 17:19

add test and validation split parameters

77e96e8

fix test splits

7668858

update documentation for generate_class_properties.py

5e0b683

update inconsistency removal for ensemble

71521df

sfluegel05 merged commit 1e2a043 into dev Jul 10, 2025
2 of 6 checks passed

sfluegel05 deleted the feature-better-split-ratios branch July 10, 2025 10:28

aditya0by0 restored the feature-better-split-ratios branch July 11, 2025 14:03

aditya0by0 mentioned this pull request Jul 11, 2025

Adapt Unittest for new splits parameters #103

Merged

sfluegel05 changed the title ~~Split ration and inconsistency removal for ensemble~~ Split ratio and inconsistency removal for ensemble Jul 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split ratio and inconsistency removal for ensemble #102

Split ratio and inconsistency removal for ensemble #102

Uh oh!

sfluegel05 commented Jul 10, 2025

Uh oh!

Uh oh!

aditya0by0 commented Jul 10, 2025

Uh oh!

sfluegel05 commented Jul 11, 2025

Uh oh!

aditya0by0 commented Jul 11, 2025

Uh oh!

sfluegel05 commented Jul 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Split ratio and inconsistency removal for ensemble #102

Split ratio and inconsistency removal for ensemble #102

Uh oh!

Conversation

sfluegel05 commented Jul 10, 2025

Uh oh!

Uh oh!

aditya0by0 commented Jul 10, 2025

Uh oh!

sfluegel05 commented Jul 11, 2025

Uh oh!

aditya0by0 commented Jul 11, 2025

Uh oh!

sfluegel05 commented Jul 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants