CM-ScBLL

A machine learning tool for predicting B-cell Lymphoblastic Leukemia (BLL) subtypes from single-cell RNA sequencing data.

Project Overview

CM-ScBLL is designed to predict BLL subtypes from single-cell RNA sequencing data using a neural network model. The project provides both a web interface (Streamlit) and a command-line interface (CLI) for making predictions.

Key Features

Predict BLL subtypes from single-cell gene expression data
Process bulk cell samples and generate prediction distributions
Visualize subtype predictions with interactive plots
Export prediction results for further analysis

Installation

Prerequisites

Python 3.9+
Git

Clone the Repository

git clone -b main https://github.com/ChildrensMercyResearchInstitute/cm-scbll
cd cm-scbll

Create Virtual Environment

# Create virtual environment
python3 -m venv sc_env

# Activate virtual environment
# macOS/Linux
source sc_env/bin/activate

# Windows
sc_env\Scripts\activate

Install Dependencies

pip install -r requirements.txt

Input Data Format

Count Data File Format

A CSV file with cells as rows and genes as columns:

First column should contain cell identifiers (samples as the column name)
Remaining columns should contain gene expression values (gene symbols as column names)

Sample Sheet Format

A CSV file mapping cell identifiers to sample IDs:

cell_sample: Cell identifier matching the count data file
sample or patient: Sample/patient identifier

Note on Sample Data

Please refer to the sample_data directory for example input files. These files demonstrate the expected formats for count data and sample sheets, which can be used as templates for your own data.

Usage

Web Interface (Streamlit)

Launch the web application with:

streamlit run src/app/app.py

The Streamlit interface provides:

File upload forms for count data and sample sheets
Interactive visualization of prediction distributions
Sample selection for detailed analysis

Command-Line Interface (CLI)

For batch processing or integration into pipelines, use the CLI version:

python src/app/cli_app.py --count-data PATH_TO_COUNT_DATA.csv --sample-sheet PATH_TO_SAMPLE_SHEET.csv

CLI Options

--count-data: Path to count data CSV file with cells and gene expression values (required)
--sample-sheet: Path to sample sheet CSV file mapping cells to samples (required)
--output-dir: Directory to save results (default: ./output)
--models-dir: Directory containing model files (default: ./data_utils)
--sample: Analyze only a specific sample
--save-csv: Save predictions to a CSV file

Example:

python src/app/cli_app.py --count-data ./sample_data/sample_countdata.csv --sample-sheet ./sample_data/sample_names.csv --save-csv --output-dir output

Project Structure

├── data_utils/                          # Directory containing model and utility files
│   ├── nn_test_10kf/                    # Folder with neural network model files
│   ├── label_encoder_10kf.joblib        # Label encoder for subtype predictions
│   ├── MaxAbsScalar_10kf.joblib         # Pre-trained scaler for data normalization
│   └── top_feature_all_genes_10000.txt  # List of top 10,000 genes used in the model
├── output/                              # Directory for storing prediction outputs
│   └── *.png                            # Visualization files for prediction results
├── sample_data/                         # Directory with example input files for testing
│   ├── sample_countdata.csv             # Example count data file
│   └── sample_names.csv                 # Example sample sheet file
├── sc_env/                              # Python virtual environment directory
├── src/                                 # Source code directory
│   ├── *.ipynb                          # Jupyter notebooks for data analysis and model training
│   └── app/                             # Application code directory
│       ├── app.py                       # Streamlit web interface script
│       ├── cli_app.py                   # Command-line interface script
│       └── utils.py                     # Utility functions for data processing and predictions
├── LICENSE                              # License file (GNU General Public License v3.0)
├── ReadMe.md                            # Project documentation file
└── requirements.txt                     # Python dependencies file

How It Works

Data Processing:
- Loads single-cell RNA sequencing data and sample annotations
- Filters for the genes included in the model
Prediction Pipeline:
- Normalizes gene expression data using a fitted MaxAbsScaler
- Applies the neural network model to predict cell subtypes
- Maps predictions back to samples
Visualization:
- Generates distribution plots of predicted subtypes per sample
- Highlights the dominant subtype prediction

Model Details

The neural network model was trained on labeled single-cell RNA sequencing data from BLL samples. The model uses:

10,000 selected genes as features
MaxAbsScaler for data normalization
A multi-layer neural network architecture
Output classes representing 13 BLL subtypes

Development

Data Preparation and Model Training

Jupyter notebooks in the src directory document the data preparation, feature selection, and model training process:

data_prep.ipynb: Data preprocessing
model.ipynb: Testing scikit-learn models
NN_model_features.ipynb: Feature selection
NN_model.ipynb: Neural network model development
shap_features.ipynb: Feature importance analysis using SHAP

Authors

Tarun Mamidi
- Email: tmamidi@cmh.edu

Acknowledgements

Funding

NIH

License

GNU General Public License v3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CM-ScBLL

Project Overview

Key Features

Installation

Prerequisites

Clone the Repository

Create Virtual Environment

Install Dependencies

Input Data Format

Count Data File Format

Sample Sheet Format

Note on Sample Data

Usage

Web Interface (Streamlit)

Command-Line Interface (CLI)

CLI Options

Project Structure

How It Works

Model Details

Development

Data Preparation and Model Training

Authors

Acknowledgements

Funding

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data_utils		data_utils
sample_data		sample_data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

ChildrensMercyResearchInstitute/cm-scbll

Folders and files

Latest commit

History

Repository files navigation

CM-ScBLL

Project Overview

Key Features

Installation

Prerequisites

Clone the Repository

Create Virtual Environment

Install Dependencies

Input Data Format

Count Data File Format

Sample Sheet Format

Note on Sample Data

Usage

Web Interface (Streamlit)

Command-Line Interface (CLI)

CLI Options

Project Structure

How It Works

Model Details

Development

Data Preparation and Model Training

Authors

Acknowledgements

Funding

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages