Skip to content

DLS5-Omics/GEMGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Phenotype-Guided In Silico Molecular Generation Using Large Language Models

GEMGen is a generative model for phenotype-based drug discovery.
It consists of two core components:

  • Generator: generates small-molecule candidates conditioned on cell type information and lists of up- and down-regulated genes from differential expression analysis.
  • Scorer: evaluates how well a generated small molecule matches the given phenotypic input.

The framework is designed for flexible inference and can be easily adapted to different biological contexts and datasets.

Preprint Now Available on bioRxiv

The preprint of our paper is now publicly available on bioRxiv.
Read it here: https://www.biorxiv.org/content/10.64898/2026.01.03.697483v2.full.pdf

Installation

1. Create a Conda Environment

conda create -n gemgen python=3.10 -y
conda activate gemgen

2. Install Dependencies

GEMGen depends on vllm==0.7.3, whose installation already covers most of the required Python packages (such as torch, transformers, etc.). You will only need to install pandas and rdkit additionally.

We recommend installing all dependencies via requirements.txt:

pip install -r requirements.txt

Alternatively, you may install them manually:

pip install vllm==0.7.3
pip install pandas==2.3.3
pip install rdkit==2024.3.1

Note This model only supports inference on NVIDIA GPUs. A CUDA-enabled environment with a compatible NVIDIA GPU is required.


Model Checkpoints

Pretrained model checkpoints are required for both the generator and the scorer.

  1. Download the checkpoint folder from [Zenodo website] (will be released soon).

  2. Place the checkpoint directory at a desired local path.

  3. Update the corresponding checkpoint paths in the demo scripts:

    • demo/run_generator.sh
    • demo/run_scorer.sh

Data Preparation

GEMGen expects structured input data describing phenotypic perturbations.

  • Generator inputs

    • Cell type information
    • Lists of up-regulated and down-regulated genes
    • Prompt templates (see data/generator_prompts.txt and data/templates.txt)
  • Scorer inputs

    • Generated molecules (e.g., SMILES)
    • Corresponding phenotypic conditions
    • Example format is provided in data/scorer_test_demo.tsv

Please refer to the files in the data/ directory for templates and example data formats.


Usage

1. Clone the Repository

git clone https://github.com/DLS5-Omics/GEMGen.git
cd GEMGen

2. Run the Generator

The generator produces candidate small molecules conditioned on phenotypic inputs.

bash demo/run_generator.sh

The generator produces a single JSON file containing a list of results, where each entry corresponds to one input prompt from the multi-line prompts file. Specifically: The output is a list of lists. Each inner list has two elements:

  1. prompt (str): The original input prompt (exactly as read from the input prompts file).
  2. generated_smiles (List[str]): A list of SMILES strings representing the small molecules generated by the model in response to that prompt. The number of generated SMILES per prompt is determined by the user-specified sampling count parameter.

Example Structure:

[
  [
  "With the goal of altering gene expression patterns in K562, construct a compound that increases the expression of SAMD4B, MEIOB, ... and decreases the expression of ATPSCKMT, ...,MFHAS1.",
    [
      "CC1=CN=C(C(=C1OC)C)CS(=O)C2=NC3=C(N2)C=C(C=C3)OC",
      "COC1=CC(=CC(=C1OC)OC)C2=C3C(=NC=N2)C=CC=C3N"
    ]
  ],
  [
  "In a cell population characterized as K562, formulate a compound that will upregulate the expression of PRG2, CEP85L, ..., KIAA0355 and downregulate the expression of USP32, ..., MYD88.",
    [
      "CC(=O)Nc1ccc(cc1)S(=O)(=O)N",
      "C1COCCN1C(=O)C(=O)N"
    ]
  ]
]

3. Run the Evaluation

The evaluation step filters out duplicate and invalid molecules generated by the model. Optionally, you can control whether to compute chemical metrics for the generated small molecules and their similarity to real molecules.

bash demo/run_evaluation.sh
  • Input: The JSON file generated by the generator (e.g., results/generated_molecules.json).

  • Output: A CSV file where each row corresponds to an input prompt and its associated valid generated molecules, along with evaluation metrics (if calculated).

4. Run the Scorer

The scorer evaluates the consistency between generated molecules and the input phenotypic signatures.

bash demo/run_scorer.sh

Typical outputs include:

  • A list of matching scores between molecules and phenotypic inputs

Example Workflow

  1. Prepare phenotypic input data (cell type + gene regulation).
  2. Run the generator to produce candidate molecules.
  3. Filter and post-process​ generated molecules (remove duplicates, invalid structures, etc.).
  4. Feed filtered molecules into the scorer.
  5. Select high-scoring molecules for downstream analysis or validation.

License

This project is released under the terms of the license provided in the LICENSE file.

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages