Similarity & COverage Prioritization and Evaluation for DNA-Encoded Libraries (SCOPE-DEL): Notebooks, data, and results for GTM-driven chemography to prioritize DNA-Encoded Libraries (DELs) by balancing similarity and chemical-space coverage.
- Repository structure
- Quick start (PDM)
- Data
- Reproducing figures & tables
- Results
- Contributing
- License
- Acknowledgments
SCOPE-DEL/
├── notebooks/
│ └── Analysis_results.ipynb # Metrics, comparisons, figure export
├── data/
│ ├── raw/ # Input sources (unmodified)
│ └── processed/ # Cleaned data ready for modeling
├── results/
│ ├── dels_100/ # Metrics
├── docs/
│ └── assets/
│ └── SCOPE-del-logo.png
├── pyproject.toml # PDM project config
├── pdm.lock # Locked dependency graph (generated)
├── LICENSE # MIT
└── README.md
Requires Python 3.10+ and PDM 2. If you don’t have PDM:
pipx install pdm # recommended # or: pip install -U pdm
- Create and activate the PDM virtualenv
# inside the repo root
pdm venv create -i python3.10 # or your preferred interpreter
pdm use -f .venv/bin/python # on Windows: .venv\Scripts\python.exe- Install dependencies
pdm installNotes:
- RDKit wheels are platform-dependent. The project uses
rdkit/rdkit-pypi. If installation fails on your OS, see RDKit’s wheel/conda guidance or switch the fingerprinting backend in the notebooks.
- Run the notebooks
# (optional) register a Jupyter kernel named “SCOPE-del”
pdm run python -m ipykernel install --user --name SCOPE-del
# start Jupyter
pdm run jupyter lab # or: pdm run jupyter notebook- Recreate exports
- Run
notebooks/DataPreprocessing.ipynbto populatedata/processed/. - Run
notebooks/GTM_GMM_optimization_benchmark.ipynbfor GTM map selection. - Run
notebooks/Analysis_results.ipynbto produce figures/tables inresults/.
Handy commands
# freeze dependencies for external runners/CI
pdm export -f requirements -o requirements.txt --without-hashes
# run quality tools if configured
pdm run black . && pdm run isort .data/raw/contains references (or instructions) for obtaining large sources not committed to the repo.data/processed/is generated byDataPreprocessing.ipynb(fingerprints, deduped sets, splits).- For reproducibility, the preprocessing notebook documents exact retrieval, cleaning, and descriptor parameters.
Analysis_results.ipynb exports:
- Correlation heatmaps comparing GTM-space vs fingerprint-space metrics
- Coverage-vs-similarity scatter plots
- Density/class landscapes
- Summary CSVs with overlaps, EF@k, and other selection metrics
Artifacts are written to results/figures/ and results/tables/ with filenames matching manuscript labels.
Key takeaways reproduced by the notebooks:
- GTM-derived, centroid-oriented metrics serve as practical proxies for pairwise fingerprint measures in DEL selection.
- Visual GTM landscapes help balance similarity to a reference with broader chemical-space coverage.
Issues and PRs are welcome. Please:
- Keep notebook outputs deterministic where possible.
- Note OS/Python/PDM versions when reporting issues.
- Discuss large additions (new DEL panels, alternative metrics, new GTM maps) in an issue first.
This project is licensed under the MIT License. See LICENSE for details.
- Alexey Orlov aorlov@unistra.fr - contributor to the development of ChemographyKit; performed machine learning, interpreted the data, and contributed to manuscript writing.
- Dragos Horvath dhorvath@unistra.fr - provided overarching guidance, conceived and planned the research, and supervised the overall project.
- Alexandre Varnek varnek@unistra.fr - provided overarching guidance, conceived and planned the research, and supervised the overall project.
- Louis Plyer louis.plyer@unistra.fr - contributor to the development of ChemographyKit; performed machine learning, interpreted the data, and contributed to manuscript writing.
- Tagir Akhmetshin tagirshin@gmail.com - contributor to the development of ChemographyKit.
- Erik Yeghyan varnek@unistra.fr - preparation of ChEMBL datasets used in this study.
- Fanny Bonachera varnek@unistra.fr - preparation of ChEMBL datasets used in this study.
