GEMGen is a generative model for phenotype-based drug discovery.
It consists of two core components:
- Generator: generates small-molecule candidates conditioned on cell type information and lists of up- and down-regulated genes from differential expression analysis.
- Scorer: evaluates how well a generated small molecule matches the given phenotypic input.
The framework is designed for flexible inference and can be easily adapted to different biological contexts and datasets.
The preprint of our paper is now publicly available on bioRxiv.
Read it here: https://www.biorxiv.org/content/10.64898/2026.01.03.697483v2.full.pdf
conda create -n gemgen python=3.10 -y
conda activate gemgenGEMGen depends on vllm==0.7.3, whose installation already covers most of the required Python packages (such as torch, transformers, etc.). You will only need to install pandas and rdkit additionally.
We recommend installing all dependencies via requirements.txt:
pip install -r requirements.txtAlternatively, you may install them manually:
pip install vllm==0.7.3
pip install pandas==2.3.3
pip install rdkit==2024.3.1Note This model only supports inference on NVIDIA GPUs. A CUDA-enabled environment with a compatible NVIDIA GPU is required.
Pretrained model checkpoints are required for both the generator and the scorer.
-
Download the checkpoint folder from [Zenodo website] (will be released soon).
-
Place the checkpoint directory at a desired local path.
-
Update the corresponding checkpoint paths in the demo scripts:
demo/run_generator.shdemo/run_scorer.sh
GEMGen expects structured input data describing phenotypic perturbations.
-
Generator inputs
- Cell type information
- Lists of up-regulated and down-regulated genes
- Prompt templates (see
data/generator_prompts.txtanddata/templates.txt)
-
Scorer inputs
- Generated molecules (e.g., SMILES)
- Corresponding phenotypic conditions
- Example format is provided in
data/scorer_test_demo.tsv
Please refer to the files in the data/ directory for templates and example data formats.
git clone https://github.com/DLS5-Omics/GEMGen.git
cd GEMGenThe generator produces candidate small molecules conditioned on phenotypic inputs.
bash demo/run_generator.shThe generator produces a single JSON file containing a list of results, where each entry corresponds to one input prompt from the multi-line prompts file. Specifically: The output is a list of lists. Each inner list has two elements:
- prompt (str): The original input prompt (exactly as read from the input prompts file).
- generated_smiles (List[str]): A list of SMILES strings representing the small molecules generated by the model in response to that prompt. The number of generated SMILES per prompt is determined by the user-specified sampling count parameter.
Example Structure:
[
[
"With the goal of altering gene expression patterns in K562, construct a compound that increases the expression of SAMD4B, MEIOB, ... and decreases the expression of ATPSCKMT, ...,MFHAS1.",
[
"CC1=CN=C(C(=C1OC)C)CS(=O)C2=NC3=C(N2)C=C(C=C3)OC",
"COC1=CC(=CC(=C1OC)OC)C2=C3C(=NC=N2)C=CC=C3N"
]
],
[
"In a cell population characterized as K562, formulate a compound that will upregulate the expression of PRG2, CEP85L, ..., KIAA0355 and downregulate the expression of USP32, ..., MYD88.",
[
"CC(=O)Nc1ccc(cc1)S(=O)(=O)N",
"C1COCCN1C(=O)C(=O)N"
]
]
]The evaluation step filters out duplicate and invalid molecules generated by the model. Optionally, you can control whether to compute chemical metrics for the generated small molecules and their similarity to real molecules.
bash demo/run_evaluation.sh-
Input: The JSON file generated by the generator (e.g., results/generated_molecules.json).
-
Output: A CSV file where each row corresponds to an input prompt and its associated valid generated molecules, along with evaluation metrics (if calculated).
The scorer evaluates the consistency between generated molecules and the input phenotypic signatures.
bash demo/run_scorer.shTypical outputs include:
- A list of matching scores between molecules and phenotypic inputs
- Prepare phenotypic input data (cell type + gene regulation).
- Run the generator to produce candidate molecules.
- Filter and post-process generated molecules (remove duplicates, invalid structures, etc.).
- Feed filtered molecules into the scorer.
- Select high-scoring molecules for downstream analysis or validation.
This project is released under the terms of the license provided in the LICENSE file.