-
Notifications
You must be signed in to change notification settings - Fork 137
Open
Description
We really appreciate the excellent work. But here is a question that we feel a bit confusing.
The TOML file used for training on PBMC is like:
# example_config.toml
# Dataset paths - maps dataset names to their directories
[datasets]
parse = "pbmc/"
# Training specifications
# All cell types in a dataset automatically go into training (excluding zeroshot/fewshot overrides)
[training]
parse = "train"
# Zeroshot specifications - entire cell types go to val or test
[zeroshot]
# Fewshot specifications - explicit perturbation lists
[fewshot]
[fewshot."parse.Donor1"]
val = ['G-CSF', 'IFN-beta', 'M-CSF', 'IGF-1']
test = ['C5a', 'TWEAK', 'LIF', 'IL-17C', 'FGF-beta', 'BAFF', 'FLT3L', 'IFN-lambda2', '4-1BBL', 'IL-17E', 'IL-2', 'IL-19', 'PRL', 'IL-16', 'IL-7', 'GM-CSF', 'IL-17A', 'IL-13', 'IL-36-alpha', 'PSPN', 'EPO', 'IL-9', 'IL-31', 'IL-11', 'LT-alpha1-beta2', 'TL1A', 'IL-32-beta', 'IFN-omega', 'IL-8', 'Noggin', 'IFN-epsilon', 'IL-17F', 'Leptin', 'IL-33', 'IL-35', 'OSM', 'IL-34', 'HGF', 'IL-1Ra', 'LIGHT', 'CD27L', 'SCF', 'IL-1-alpha', 'TGF-beta1', 'CD30L', 'IL-24', 'IFN-gamma', 'IFN-alpha1', 'IFN-lambda1', 'GITRL', 'VEGF', 'ADSF', 'IL-15', 'EGF', 'IL-3', 'IL-17B', 'IL-6', 'IL-4', 'IL-10', 'CT-1', 'IFN-lambda3', 'TRAIL']
[fewshot."parse.Donor4"]
val = ['G-CSF', 'IFN-beta', 'M-CSF', 'IGF-1']
test = ['C5a', 'TWEAK', 'LIF', 'IL-17C', 'FGF-beta', 'BAFF', 'FLT3L', 'IFN-lambda2', '4-1BBL', 'IL-17E', 'IL-2', 'IL-19', 'PRL', 'IL-16', 'IL-7', 'GM-CSF', 'IL-17A', 'IL-13', 'IL-36-alpha', 'PSPN', 'EPO', 'IL-9', 'IL-31', 'IL-11', 'LT-alpha1-beta2', 'TL1A', 'IL-32-beta', 'IFN-omega', 'IL-8', 'Noggin', 'IFN-epsilon', 'IL-17F', 'Leptin', 'IL-33', 'IL-35', 'OSM', 'IL-34', 'HGF', 'IL-1Ra', 'LIGHT', 'CD27L', 'SCF', 'IL-1-alpha', 'TGF-beta1', 'CD30L', 'IL-24', 'IFN-gamma', 'IFN-alpha1', 'IFN-lambda1', 'GITRL', 'VEGF', 'ADSF', 'IL-15', 'EGF', 'IL-3', 'IL-17B', 'IL-6', 'IL-4', 'IL-10', 'CT-1', 'IFN-lambda3', 'TRAIL']
[fewshot."parse.Donor9"]
val = ['G-CSF', 'IFN-beta', 'M-CSF', 'IGF-1']
test = ['C5a', 'TWEAK', 'LIF', 'IL-17C', 'FGF-beta', 'BAFF', 'FLT3L', 'IFN-lambda2', '4-1BBL', 'IL-17E', 'IL-2', 'IL-19', 'PRL', 'IL-16', 'IL-7', 'GM-CSF', 'IL-17A', 'IL-13', 'IL-36-alpha', 'PSPN', 'EPO', 'IL-9', 'IL-31', 'IL-11', 'LT-alpha1-beta2', 'TL1A', 'IL-32-beta', 'IFN-omega', 'IL-8', 'Noggin', 'IFN-epsilon', 'IL-17F', 'Leptin', 'IL-33', 'IL-35', 'OSM', 'IL-34', 'HGF', 'IL-1Ra', 'LIGHT', 'CD27L', 'SCF', 'IL-1-alpha', 'TGF-beta1', 'CD30L', 'IL-24', 'IFN-gamma', 'IFN-alpha1', 'IFN-lambda1', 'GITRL', 'VEGF', 'ADSF', 'IL-15', 'EGF', 'IL-3', 'IL-17B', 'IL-6', 'IL-4', 'IL-10', 'CT-1', 'IFN-lambda3', 'TRAIL']
[fewshot."parse.Donor12"]
val = ['G-CSF', 'IFN-beta', 'M-CSF', 'IGF-1']
test = ['C5a', 'TWEAK', 'LIF', 'IL-17C', 'FGF-beta', 'BAFF', 'FLT3L', 'IFN-lambda2', '4-1BBL', 'IL-17E', 'IL-2', 'IL-19', 'PRL', 'IL-16', 'IL-7', 'GM-CSF', 'IL-17A', 'IL-13', 'IL-36-alpha', 'PSPN', 'EPO', 'IL-9', 'IL-31', 'IL-11', 'LT-alpha1-beta2', 'TL1A', 'IL-32-beta', 'IFN-omega', 'IL-8', 'Noggin', 'IFN-epsilon', 'IL-17F', 'Leptin', 'IL-33', 'IL-35', 'OSM', 'IL-34', 'HGF', 'IL-1Ra', 'LIGHT', 'CD27L', 'SCF', 'IL-1-alpha', 'TGF-beta1', 'CD30L', 'IL-24', 'IFN-gamma', 'IFN-alpha1', 'IFN-lambda1', 'GITRL', 'VEGF', 'ADSF', 'IL-15', 'EGF', 'IL-3', 'IL-17B', 'IL-6', 'IL-4', 'IL-10', 'CT-1', 'IFN-lambda3', 'TRAIL']
The function _process_celltype in PerturbationDataModule is like:
def _process_celltype(
self,
ds: PerturbationDataset,
celltype: str,
ct_indices: np.ndarray,
ctrl_indices: np.ndarray,
pert_indices: np.ndarray,
cache,
dataset_name: str,
zeroshot_celltypes: dict[str, str],
fewshot_celltypes: dict[str, dict[str, list[str]]],
is_training_dataset: bool,
) -> dict[str, int]:
"""Process a single cell type and return counts for each split."""
counts = {"train": 0, "val": 0, "test": 0}
if celltype in zeroshot_celltypes:
...
elif celltype in fewshot_celltypes:
# Fewshot: split perturbations according to config
pert_config = fewshot_celltypes[celltype]
split_counts = self._split_fewshot_celltype(
ds, pert_indices, ctrl_indices, cache, pert_config
)
for split, count in split_counts.items():
counts[split] += count
elif is_training_dataset:
...
return counts
The keys of dict fewshot_celltype will be [Donor1, Donor4,Donor9,Donor12],so celltype in fewshot_celltypes will always be False, how to understand this?
Metadata
Metadata
Assignees
Labels
No labels