A reverse TF-centric machine learning framework that classifies peripheral blood mono-nuclear cells (PBMCs) using integrated chromatin accessibility and gene expression data.
- Background
- Overview of the mechanisms influencing chromatin accessibility
- Workflow
- Code Availability
- Reproducibility
- License
- Contributors
Omics Codeathon General Application - October 2025
Organized by the African Society for Bioinformatics and Computational Biology (ASBCB) with support from the NIH Office of Data Science Strategy.
Single-cell chromatin accessibility sequencing (scATAC-seq) enables genome-wide profiling of regulatory elements at single-cell resolution.Traditional pipelines identify accessible regions first, then infer TF activity, limiting comprehensive understanding of regulatory programs driving cellular identity. This study develops a reverse TF-centric machine learning framework to classify peripheral blood mononuclear cells (PBMCs) using integrated chromatin accessibility and gene expression profiles. Our approach addresses data quality challenges through optimized preprocessing, implements class balancing via SMOTE, and employs ensemble ML methods for robust classification. The resulting computational pipeline enhances single-cell analysis capabilities and provides a systematic approach for discovering TF regulatory networks in immune cell populations.
Figure 1. Workflow of the methods employed in this study
To get started with the scATACtf pipeline, please refer to our step-by-step workflow guide:
This guide will walk you through the complete analysis from data acquisition to visualization.
graph LR
A[Raw 10X Data] --> B[Quality Control]
B --> C[Feature Engineering & integration]
C --> D[ML Building]
D --> E[Evalutaion Metrics]
E --> F[Validation & interpretation]
All scripts for the scATACtf project (Python & R) are available in the repository:
π Browse the scripts: Scripts Running
- Public dataset: PBMC from a Healthy Donor (10k, 10x Genomics)
Cell types retained :
- B cells
- Monocytes
- NK cells
- T cells
Excluded rare cell types (<10 samples):
- HSC-G-CSF
- Pre-B cells CD34-
Final dataset after filtering: about β 1,400 cells across 4 cell types
This table summarizes the performance of the top-performing machine learning models across the three implemented analytical frameworks used in the scATAC-tf study.
| Framework | Best Model(s) | Accuracy (%) | Key Strengths | Key Weaknesses |
|---|---|---|---|---|
| scATACtf (4 cell types, balanced) |
XGBoost | 98.28 | β’ Highest overall accuracy β’ Excellent feature discrimination β’ Strong biological marker identification |
β’ Requires SMOTE balancing (artificial data) β’ Computationally intensive |
| CF_scATAC-tf5 (4 cell types, rare-cells filtration) |
Logistic Regression | 97.49 | β’ Robust to natural class imbalance β’ Fast training β’ "Good Fit" status |
β’ Slightly lower accuracy than scATAC-tf β’ Linear assumptions may miss complex patterns |
| scATAC-tf5 (6 cell types, six cell-types) |
Neural Network Logistic Regression |
96.80 96.60 |
β’ Successfully classifies rare populations β’ Maintains "Good Fit" despite extreme imbalance |
β’ Lower F-scores for rare populations β’ High statistical uncertainty for rare cells β’ Careful interpretation of rare cell results |
- XGBoost provides the highest classification accuracy when dealing with balanced data (Framework 1).
- Logistic Regression demonstrates superior robustness and generalization ("Good Fit") when the focus is on natural, unfiltered class distributions (Framework 2 & 3).
- The models maintained high performance, successfully classifying up to 6 cell types, even those with rare samples (Framework 3).
| Language | Key Packages |
|---|---|
| Python | PyTorch, scikit-learn, pandas |
| R | Seurat, Signac |
| Step | Recommended Resources |
|---|---|
| Pre-processing | I-MAC: 3.6 GHz 10-Core Intel Core i9, 64 GB RAM, 10 GB storage |
| Modeling & Scripts | NVIDIA A100-SXM4-80GB GPU (CUDA 12.2, DriverΒ 535.247.01) |
All scripts use fixed random seed (42) for reproducibility
all package versions (R - Python) specified for this project
To report an issue please use the issues page (https://github.com/omicscodeathon/scatactf/issues). Please check existing issues before submitting a new one.
You can offer to help with the further development of this project by making pull requests on this repo. To do so, fork this repository and make the proposed changes. Once completed and tested, submit a pull request to this repo.
| Name | Affiliation | Role |
|---|---|---|
| Rana Hamed | Student, School of Computing and Data Science, Badya University, Cairo, Egypt | Team Lead β Project Management |
| Syrus Semawule | African Center of Excellence in Bioinformatics and Data Intensive Sciences, The Infectious Disease Institute, Makerere University, Kampala, Uganda | Bioinformatician β Data Processing & Biological Annotation |
| Emmanuel Aroma | Department of Immunology and Molecular Biology, School of Biomedical Sciences, Makerere University, Kampala, Uganda | Bioinformatician β ML Modeling & Pipeline Control |
| Toheeb Jumah | Department of Human Anatomy, Faculty of Basic Medical Sciences, College of Medical Sciences, Ahmadu Bello University, Zaria, Nigeria | Bioinformatician β Manuscript Writing & ML Modeling |
| Olaitan I. Awe | African Society for Bioinformatics and Computational Biology (ASBCB), Cape Town, South Africa | Project Advisor |
π§ Rana Hamed Abu-Zeid : ranahamed2111@gmail.com
π§ Syrus Semawule : semawulesyrus@gmail.com
π§ Emmanuel Aroma : emmatitusaroma@gmail.com
π§ Toheeb Jumah : jumahtoheeb@gmail.com
π§ Olaitan I. Awe, Ph.D. : laitanawe@gmail.com
We thank the NIH Office of Data Science Strategy for their support before and during the October 2025 Omics Codeathon, co-organized with the African Society for Bioinformatics and Computational Biology (ASBCB).
We also thank Dr. Awe for his ongoing guidance and all collaborators who contributed to this project.
This project reflects a collaborative effort towards advancing integrative bioinformatics methods, and we look forward to its continued development and impact within the scientific community.



