scATACtf:

A reverse TF-centric machine learning framework that classifies peripheral blood mono-nuclear cells (PBMCs) using integrated chromatin accessibility and gene expression data.

1. Background

Single-cell chromatin accessibility sequencing (scATAC-seq) enables genome-wide profiling of regulatory elements at single-cell resolution.Traditional pipelines identify accessible regions first, then infer TF activity, limiting comprehensive understanding of regulatory programs driving cellular identity. This study develops a reverse TF-centric machine learning framework to classify peripheral blood mononuclear cells (PBMCs) using integrated chromatin accessibility and gene expression profiles. Our approach addresses data quality challenges through optimized preprocessing, implements class balancing via SMOTE, and employs ensemble ML methods for robust classification. The resulting computational pipeline enhances single-cell analysis capabilities and provides a systematic approach for discovering TF regulatory networks in immune cell populations.

2. Overview of the mechanisms influencing chromatin accessibility

3. Workflow

Figure 1. Workflow of the methods employed in this study

Detailed Workflow

To get started with the scATACtf pipeline, please refer to our step-by-step workflow guide:

👉 Pipeline Workflow Guide

This guide will walk you through the complete analysis from data acquisition to visualization.

Pipeline Architecture

graph LR
    A[Raw 10X Data] --> B[Quality Control]
    B --> C[Feature Engineering & integration]
    C --> D[ML Building]
    D --> E[Evalutaion Metrics]
    E --> F[Validation & interpretation]

4. Code Avilability:

All scripts for the scATACtf project (Python & R) are available in the repository:

👉 Browse the scripts: Scripts Running

Demonstration Data

Public dataset: PBMC from a Healthy Donor (10k, 10x Genomics)

The main analysis includes the following cell types:

Cell types retained :

B cells
Monocytes
NK cells
T cells

Excluded rare cell types (<10 samples):

HSC-G-CSF
Pre-B cells CD34-

Final dataset after filtering: about ≈ 1,400 cells across 4 cell types

scATAC-tf: Model Performance Comparison Across Analytical Frameworks

This table summarizes the performance of the top-performing machine learning models across the three implemented analytical frameworks used in the scATAC-tf study.

Framework	Best Model(s)	Accuracy (%)	Key Strengths	Key Weaknesses
scATACtf (4 cell types, balanced)	XGBoost	98.28	• Highest overall accuracy • Excellent feature discrimination • Strong biological marker identification	• Requires SMOTE balancing (artificial data) • Computationally intensive
CF_scATAC-tf5 (4 cell types, rare-cells filtration)	Logistic Regression	97.49	• Robust to natural class imbalance • Fast training • "Good Fit" status	• Slightly lower accuracy than scATAC-tf • Linear assumptions may miss complex patterns
scATAC-tf5 (6 cell types, six cell-types)	Neural Network Logistic Regression	96.80 96.60	• Successfully classifies rare populations • Maintains "Good Fit" despite extreme imbalance	• Lower F-scores for rare populations • High statistical uncertainty for rare cells • Careful interpretation of rare cell results

Key Takeaways

XGBoost provides the highest classification accuracy when dealing with balanced data (Framework 1).
Logistic Regression demonstrates superior robustness and generalization ("Good Fit") when the focus is on natural, unfiltered class distributions (Framework 2 & 3).
The models maintained high performance, successfully classifying up to 6 cell types, even those with rare samples (Framework 3).

Computational Framework

Language	Key Packages
Python	PyTorch, scikit-learn, pandas
R	Seurat, Signac

Computational Resources

Step	Recommended Resources
Pre-processing	I-MAC: 3.6 GHz 10-Core Intel Core i9, 64 GB RAM, 10 GB storage
Modeling & Scripts	NVIDIA A100-SXM4-80GB GPU (CUDA 12.2, Driver 535.247.01)

5. Reproducibility

Random Seeds

All scripts use fixed random seed (42) for reproducibility

Packagies & dependencies :

all package versions (R - Python) specified for this project

6. License

License :

Reporting Issues

To report an issue please use the issues page (https://github.com/omicscodeathon/scatactf/issues). Please check existing issues before submitting a new one.

Contribute to Project

You can offer to help with the further development of this project by making pull requests on this repo. To do so, fork this repository and make the proposed changes. Once completed and tested, submit a pull request to this repo.

7. Contributors

Name	Affiliation	Role
Rana Hamed	Student, School of Computing and Data Science, Badya University, Cairo, Egypt	Team Lead – Project Management
Syrus Semawule	African Center of Excellence in Bioinformatics and Data Intensive Sciences, The Infectious Disease Institute, Makerere University, Kampala, Uganda	Bioinformatician – Data Processing & Biological Annotation
Emmanuel Aroma	Department of Immunology and Molecular Biology, School of Biomedical Sciences, Makerere University, Kampala, Uganda	Bioinformatician – ML Modeling & Pipeline Control
Toheeb Jumah	Department of Human Anatomy, Faculty of Basic Medical Sciences, College of Medical Sciences, Ahmadu Bello University, Zaria, Nigeria	Bioinformatician – Manuscript Writing & ML Modeling
Olaitan I. Awe	African Society for Bioinformatics and Computational Biology (ASBCB), Cape Town, South Africa	Project Advisor

📧 Rana Hamed Abu-Zeid : ranahamed2111@gmail.com
📧 Syrus Semawule : semawulesyrus@gmail.com
📧 Emmanuel Aroma : emmatitusaroma@gmail.com
📧 Toheeb Jumah : jumahtoheeb@gmail.com
📧 Olaitan I. Awe, Ph.D. : laitanawe@gmail.com

Acknowledgments

We thank the NIH Office of Data Science Strategy for their support before and during the October 2025 Omics Codeathon, co-organized with the African Society for Bioinformatics and Computational Biology (ASBCB).
We also thank Dr. Awe for his ongoing guidance and all collaborators who contributed to this project.

This project reflects a collaborative effort towards advancing integrative bioinformatics methods, and we look forward to its continued development and impact within the scientific community.

Name		Name	Last commit message	Last commit date
Latest commit History 358 Commits
accessions		accessions
data		data
docs		docs
figures		figures
output		output
scripts		scripts
workflow		workflow
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scATACtf:

Table of Contents

1. Background

2. Overview of the mechanisms influencing chromatin accessibility

3. Workflow

Detailed Workflow

Pipeline Architecture

4. Code Avilability:

Demonstration Data

The main analysis includes the following cell types:

scATAC-tf: Model Performance Comparison Across Analytical Frameworks

Key Takeaways

Computational Framework

Computational Resources

5. Reproducibility

Random Seeds

Packagies & dependencies :

6. License

Reporting Issues

Contribute to Project

7. Contributors

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

omicscodeathon/scatactf

Folders and files

Latest commit

History

Repository files navigation

scATACtf:

Table of Contents

1. Background

2. Overview of the mechanisms influencing chromatin accessibility

3. Workflow

Detailed Workflow

Pipeline Architecture

4. Code Avilability:

Demonstration Data

The main analysis includes the following cell types:

scATAC-tf: Model Performance Comparison Across Analytical Frameworks

Key Takeaways

Computational Framework

Computational Resources

5. Reproducibility

Random Seeds

Packagies & dependencies :

6. License

Reporting Issues

Contribute to Project

7. Contributors

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages