Skip to content

omicscodeathon/scatactf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

scATACtf:

A reverse TF-centric machine learning framework that classifies peripheral blood mono-nuclear cells (PBMCs) using integrated chromatin accessibility and gene expression data.

License: MIT

scATAC-tf logo

Table of Contents

  1. Background
  2. Overview of the mechanisms influencing chromatin accessibility
  3. Workflow
  4. Code Availability
  5. Reproducibility
  6. License
  7. Contributors


Omics Codeathon General Application - October 2025
Organized by the African Society for Bioinformatics and Computational Biology (ASBCB) with support from the NIH Office of Data Science Strategy.


1. Background

Single-cell chromatin accessibility sequencing (scATAC-seq) enables genome-wide profiling of regulatory elements at single-cell resolution.Traditional pipelines identify accessible regions first, then infer TF activity, limiting comprehensive understanding of regulatory programs driving cellular identity. This study develops a reverse TF-centric machine learning framework to classify peripheral blood mononuclear cells (PBMCs) using integrated chromatin accessibility and gene expression profiles. Our approach addresses data quality challenges through optimized preprocessing, implements class balancing via SMOTE, and employs ensemble ML methods for robust classification. The resulting computational pipeline enhances single-cell analysis capabilities and provides a systematic approach for discovering TF regulatory networks in immune cell populations.


2. Overview of the mechanisms influencing chromatin accessibility

scATAC-tf

3. Workflow

scATAC-tf

Figure 1. Workflow of the methods employed in this study

Detailed Workflow

To get started with the scATACtf pipeline, please refer to our step-by-step workflow guide:

πŸ‘‰ Pipeline Workflow Guide

This guide will walk you through the complete analysis from data acquisition to visualization.


Pipeline Architecture

graph LR
    A[Raw 10X Data] --> B[Quality Control]
    B --> C[Feature Engineering & integration]
    C --> D[ML Building]
    D --> E[Evalutaion Metrics]
    E --> F[Validation & interpretation]
Loading

4. Code Avilability:

All scripts for the scATACtf project (Python & R) are available in the repository:

πŸ‘‰ Browse the scripts: Scripts Running


Demonstration Data


The main analysis includes the following cell types:

Cell types retained :

  • B cells
  • Monocytes
  • NK cells
  • T cells

Excluded rare cell types (<10 samples):

  • HSC-G-CSF
  • Pre-B cells CD34-

Final dataset after filtering: about β‰ˆ 1,400 cells across 4 cell types


scATAC-tf: Model Performance Comparison Across Analytical Frameworks

This table summarizes the performance of the top-performing machine learning models across the three implemented analytical frameworks used in the scATAC-tf study.

Framework Best Model(s) Accuracy (%) Key Strengths Key Weaknesses
scATACtf
(4 cell types, balanced)
XGBoost 98.28 β€’ Highest overall accuracy
β€’ Excellent feature discrimination
β€’ Strong biological marker identification
β€’ Requires SMOTE balancing (artificial data)
β€’ Computationally intensive
CF_scATAC-tf5
(4 cell types, rare-cells filtration)
Logistic Regression 97.49 β€’ Robust to natural class imbalance
β€’ Fast training
β€’ "Good Fit" status
β€’ Slightly lower accuracy than scATAC-tf
β€’ Linear assumptions may miss complex patterns
scATAC-tf5
(6 cell types, six cell-types)
Neural Network
Logistic Regression
96.80
96.60
β€’ Successfully classifies rare populations
β€’ Maintains "Good Fit" despite extreme imbalance
β€’ Lower F-scores for rare populations
β€’ High statistical uncertainty for rare cells
β€’ Careful interpretation of rare cell results

scATAC-tf


Key Takeaways

  • XGBoost provides the highest classification accuracy when dealing with balanced data (Framework 1).
  • Logistic Regression demonstrates superior robustness and generalization ("Good Fit") when the focus is on natural, unfiltered class distributions (Framework 2 & 3).
  • The models maintained high performance, successfully classifying up to 6 cell types, even those with rare samples (Framework 3).

Computational Framework

Language Key Packages
Python PyTorch, scikit-learn, pandas
R Seurat, Signac

Computational Resources

Step Recommended Resources
Pre-processing I-MAC: 3.6 GHz 10-Core Intel Core i9, 64 GB RAM, 10 GB storage
Modeling & Scripts NVIDIA A100-SXM4-80GB GPU (CUDA 12.2, DriverΒ 535.247.01)


5. Reproducibility

Random Seeds

All scripts use fixed random seed (42) for reproducibility

Packagies & dependencies :

all package versions (R - Python) specified for this project


6. License

License : License: MIT

Reporting Issues

To report an issue please use the issues page (https://github.com/omicscodeathon/scatactf/issues). Please check existing issues before submitting a new one.

Contribute to Project

You can offer to help with the further development of this project by making pull requests on this repo. To do so, fork this repository and make the proposed changes. Once completed and tested, submit a pull request to this repo.

7. Contributors

Name Affiliation Role
Rana Hamed Student, School of Computing and Data Science, Badya University, Cairo, Egypt Team Lead – Project Management
Syrus Semawule African Center of Excellence in Bioinformatics and Data Intensive Sciences, The Infectious Disease Institute, Makerere University, Kampala, Uganda Bioinformatician – Data Processing & Biological Annotation
Emmanuel Aroma Department of Immunology and Molecular Biology, School of Biomedical Sciences, Makerere University, Kampala, Uganda Bioinformatician – ML Modeling & Pipeline Control
Toheeb Jumah Department of Human Anatomy, Faculty of Basic Medical Sciences, College of Medical Sciences, Ahmadu Bello University, Zaria, Nigeria Bioinformatician – Manuscript Writing & ML Modeling
Olaitan I. Awe African Society for Bioinformatics and Computational Biology (ASBCB), Cape Town, South Africa Project Advisor

πŸ“§ Rana Hamed Abu-Zeid : ranahamed2111@gmail.com
πŸ“§ Syrus Semawule : semawulesyrus@gmail.com
πŸ“§ Emmanuel Aroma : emmatitusaroma@gmail.com
πŸ“§ Toheeb Jumah : jumahtoheeb@gmail.com
πŸ“§ Olaitan I. Awe, Ph.D. : laitanawe@gmail.com


Acknowledgments

We thank the NIH Office of Data Science Strategy for their support before and during the October 2025 Omics Codeathon, co-organized with the African Society for Bioinformatics and Computational Biology (ASBCB).
We also thank Dr. Awe for his ongoing guidance and all collaborators who contributed to this project.


This project reflects a collaborative effort towards advancing integrative bioinformatics methods, and we look forward to its continued development and impact within the scientific community.

About

Reverse TF-Centric Modeling of Gene Regulation from scATAC-seq Data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5