Skip to content

Conversation

@vidvath7
Copy link
Collaborator

@vidvath7 vidvath7 commented Sep 20, 2024

chebi_augmentation.yml

aug_data:
Set this parameter to True to enable the generation of an augmented dataset. When set to False, only the regular dataset will be generated.

augment_data_batch_size:
Configure this parameter to define the batch size for processing the augmented data.pkl file.

num_smiles_variations :
Use this parameter to specify the maximum number of SMILES variations to generate for each compound. This helps control the diversity of the generated dataset.

chebi.py:

AugmentedDataExtractor Class
This class inherits from the main ChEBI data extractor to specialize in generating augmented datasets. It supports custom configurations such as batch size and number of SMILES variations.

augment_data():
Verifies the existence of the original data.pkl file and, if found, generates and saves the augmented data.pkl file in the specified augmented directory.

generate_smiles_variations():
Produces SMILES variations based on different configurations like rooted atoms and randomization.

setup_processed():
Prepares processed data for the augmented ChEBI dataset by transforming and saving it in the required format.

@sfluegel05 sfluegel05 marked this pull request as draft September 20, 2024 08:41
@sfluegel05
Copy link
Collaborator

This has gotten wildly out of date, but would be an actually useful feature. At this point, it is probably easier to re-implement it based on the current dev branch.

@aditya0by0 could you give this a try? If you need more background information, we can also discuss this at our next meeting

@sfluegel05 sfluegel05 mentioned this pull request Jul 29, 2025
@aditya0by0 aditya0by0 closed this Jul 29, 2025
@aditya0by0 aditya0by0 deleted the feature/data-augmentation branch July 29, 2025 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants