UU Applied Pharmaceutical Bioinformatics Final Exam - May 2025
You are given a dataset of small molecules and their biological activity (πΎπ, in ππ). Your task is to explore the data, build regression and classification models, evaluate them, and reflect on your modelling process and results.
Submit a PDF report. Code is optional but encouraged as an appendix or separate file.
- Data Exploration
- Load and inspect the dataset.
- Analyse the distribution of πΎπ and decide whether to apply a log-transformation (e.g., log10(πΎπ)). Justify your choice and apply it if appropriate.
- Descriptor Preparation
- Generate suitable molecular descriptors (e.g., ECFP4, RDKit descriptors).
- Justify your descriptor choices.
- Descriptor Space Visualization
- Use at least one dimensionality reduction method (e.g., PCA, t-SNE, UMAP).
- Highlight clustering or separation of actives/inactives.
- Comment on how structurally diverse the dataset appears based on clustering or spread in the projected space.
- Optionally compare multiple methods.
- Modelling
-
Use and compare at least three machine learning methods for each task.
- Regression
- Build and evaluate regression models using 10-fold cross-validation.
- Report π 2 and πππΈ.
- Classification
- Choose at least two thresholds for active/inactive classification. Try to make one chemically meaningful value (e.g., πΎπ < 1000 nM) and one which gives a more unrealistic but balanced setup.
- For each threshold: β Train three models using descriptors. β Evaluate with 10-fold cross-validation (CV). β Report AUC, precision, recall, F1-score. β Reflect on how thresholds affect class balance, interpretability and usefulness.
- Discussion
- Compare model performance and metric outcomes.
- Discuss effects of descriptor relevance, threshold impact, and class imbalance.
- Report Format
- Recommended structure:
- Introduction
- Methods
- Results
- Discussion
- Conclusion
- Suggested length: 4β6 pages.
- Submission
- Submit your PDF report.
- Code is optional but encouraged.