Question about the function `_process_celltype`

We really appreciate the excellent work. But here is a question that we feel a bit confusing.

The TOML file used for training on PBMC is like:
```
# example_config.toml
# Dataset paths - maps dataset names to their directories
[datasets]
parse = "pbmc/"

# Training specifications
# All cell types in a dataset automatically go into training (excluding zeroshot/fewshot overrides)
[training]
parse = "train"

# Zeroshot specifications - entire cell types go to val or test
[zeroshot]

# Fewshot specifications - explicit perturbation lists
[fewshot]
[fewshot."parse.Donor1"]
val = ['G-CSF', 'IFN-beta', 'M-CSF', 'IGF-1']
test = ['C5a', 'TWEAK', 'LIF', 'IL-17C', 'FGF-beta', 'BAFF', 'FLT3L', 'IFN-lambda2', '4-1BBL', 'IL-17E', 'IL-2', 'IL-19', 'PRL', 'IL-16', 'IL-7', 'GM-CSF', 'IL-17A', 'IL-13', 'IL-36-alpha', 'PSPN', 'EPO', 'IL-9', 'IL-31', 'IL-11', 'LT-alpha1-beta2', 'TL1A', 'IL-32-beta', 'IFN-omega', 'IL-8', 'Noggin', 'IFN-epsilon', 'IL-17F', 'Leptin', 'IL-33', 'IL-35', 'OSM', 'IL-34', 'HGF', 'IL-1Ra', 'LIGHT', 'CD27L', 'SCF', 'IL-1-alpha', 'TGF-beta1', 'CD30L', 'IL-24', 'IFN-gamma', 'IFN-alpha1', 'IFN-lambda1', 'GITRL', 'VEGF', 'ADSF', 'IL-15', 'EGF', 'IL-3', 'IL-17B', 'IL-6', 'IL-4', 'IL-10', 'CT-1', 'IFN-lambda3', 'TRAIL']

[fewshot."parse.Donor4"]
val = ['G-CSF', 'IFN-beta', 'M-CSF', 'IGF-1']
test = ['C5a', 'TWEAK', 'LIF', 'IL-17C', 'FGF-beta', 'BAFF', 'FLT3L', 'IFN-lambda2', '4-1BBL', 'IL-17E', 'IL-2', 'IL-19', 'PRL', 'IL-16', 'IL-7', 'GM-CSF', 'IL-17A', 'IL-13', 'IL-36-alpha', 'PSPN', 'EPO', 'IL-9', 'IL-31', 'IL-11', 'LT-alpha1-beta2', 'TL1A', 'IL-32-beta', 'IFN-omega', 'IL-8', 'Noggin', 'IFN-epsilon', 'IL-17F', 'Leptin', 'IL-33', 'IL-35', 'OSM', 'IL-34', 'HGF', 'IL-1Ra', 'LIGHT', 'CD27L', 'SCF', 'IL-1-alpha', 'TGF-beta1', 'CD30L', 'IL-24', 'IFN-gamma', 'IFN-alpha1', 'IFN-lambda1', 'GITRL', 'VEGF', 'ADSF', 'IL-15', 'EGF', 'IL-3', 'IL-17B', 'IL-6', 'IL-4', 'IL-10', 'CT-1', 'IFN-lambda3', 'TRAIL']

[fewshot."parse.Donor9"]
val = ['G-CSF', 'IFN-beta', 'M-CSF', 'IGF-1']
test = ['C5a', 'TWEAK', 'LIF', 'IL-17C', 'FGF-beta', 'BAFF', 'FLT3L', 'IFN-lambda2', '4-1BBL', 'IL-17E', 'IL-2', 'IL-19', 'PRL', 'IL-16', 'IL-7', 'GM-CSF', 'IL-17A', 'IL-13', 'IL-36-alpha', 'PSPN', 'EPO', 'IL-9', 'IL-31', 'IL-11', 'LT-alpha1-beta2', 'TL1A', 'IL-32-beta', 'IFN-omega', 'IL-8', 'Noggin', 'IFN-epsilon', 'IL-17F', 'Leptin', 'IL-33', 'IL-35', 'OSM', 'IL-34', 'HGF', 'IL-1Ra', 'LIGHT', 'CD27L', 'SCF', 'IL-1-alpha', 'TGF-beta1', 'CD30L', 'IL-24', 'IFN-gamma', 'IFN-alpha1', 'IFN-lambda1', 'GITRL', 'VEGF', 'ADSF', 'IL-15', 'EGF', 'IL-3', 'IL-17B', 'IL-6', 'IL-4', 'IL-10', 'CT-1', 'IFN-lambda3', 'TRAIL']

[fewshot."parse.Donor12"]
val = ['G-CSF', 'IFN-beta', 'M-CSF', 'IGF-1']
test = ['C5a', 'TWEAK', 'LIF', 'IL-17C', 'FGF-beta', 'BAFF', 'FLT3L', 'IFN-lambda2', '4-1BBL', 'IL-17E', 'IL-2', 'IL-19', 'PRL', 'IL-16', 'IL-7', 'GM-CSF', 'IL-17A', 'IL-13', 'IL-36-alpha', 'PSPN', 'EPO', 'IL-9', 'IL-31', 'IL-11', 'LT-alpha1-beta2', 'TL1A', 'IL-32-beta', 'IFN-omega', 'IL-8', 'Noggin', 'IFN-epsilon', 'IL-17F', 'Leptin', 'IL-33', 'IL-35', 'OSM', 'IL-34', 'HGF', 'IL-1Ra', 'LIGHT', 'CD27L', 'SCF', 'IL-1-alpha', 'TGF-beta1', 'CD30L', 'IL-24', 'IFN-gamma', 'IFN-alpha1', 'IFN-lambda1', 'GITRL', 'VEGF', 'ADSF', 'IL-15', 'EGF', 'IL-3', 'IL-17B', 'IL-6', 'IL-4', 'IL-10', 'CT-1', 'IFN-lambda3', 'TRAIL']

```
The function `_process_celltype` in `PerturbationDataModule` is like:
```
    def _process_celltype(
        self,
        ds: PerturbationDataset,
        celltype: str,
        ct_indices: np.ndarray,
        ctrl_indices: np.ndarray,
        pert_indices: np.ndarray,
        cache,
        dataset_name: str,
        zeroshot_celltypes: dict[str, str],
        fewshot_celltypes: dict[str, dict[str, list[str]]],
        is_training_dataset: bool,
    ) -> dict[str, int]:
        """Process a single cell type and return counts for each split."""
        counts = {"train": 0, "val": 0, "test": 0}

        if celltype in zeroshot_celltypes:
            ...

        elif celltype in fewshot_celltypes:
            # Fewshot: split perturbations according to config
            pert_config = fewshot_celltypes[celltype]
            split_counts = self._split_fewshot_celltype(
                ds, pert_indices, ctrl_indices, cache, pert_config
            )
            for split, count in split_counts.items():
                counts[split] += count

        elif is_training_dataset:
            ...

        return counts
```
The keys of dict `fewshot_celltype` will be `[Donor1, Donor4,Donor9,Donor12]`，so `celltype in fewshot_celltypes` will always be False, how to understand this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about the function `_process_celltype` #241

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about the function _process_celltype #241

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Question about the function `_process_celltype` #241