Skip to content

ChildrensMercyResearchInstitute/cm-scbll

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CM-ScBLL

A machine learning tool for predicting B-cell Lymphoblastic Leukemia (BLL) subtypes from single-cell RNA sequencing data.

Project Overview

CM-ScBLL is designed to predict BLL subtypes from single-cell RNA sequencing data using a neural network model. The project provides both a web interface (Streamlit) and a command-line interface (CLI) for making predictions.

Key Features

  • Predict BLL subtypes from single-cell gene expression data
  • Process bulk cell samples and generate prediction distributions
  • Visualize subtype predictions with interactive plots
  • Export prediction results for further analysis

Installation

Prerequisites

  • Python 3.9+
  • Git

Clone the Repository

git clone -b main https://github.com/ChildrensMercyResearchInstitute/cm-scbll
cd cm-scbll

Create Virtual Environment

# Create virtual environment
python3 -m venv sc_env

# Activate virtual environment
# macOS/Linux
source sc_env/bin/activate

# Windows
sc_env\Scripts\activate

Install Dependencies

pip install -r requirements.txt

Input Data Format

Count Data File Format

A CSV file with cells as rows and genes as columns:

  • First column should contain cell identifiers (samples as the column name)
  • Remaining columns should contain gene expression values (gene symbols as column names)

Sample Sheet Format

A CSV file mapping cell identifiers to sample IDs:

  • cell_sample: Cell identifier matching the count data file
  • sample or patient: Sample/patient identifier

Note on Sample Data

Please refer to the sample_data directory for example input files. These files demonstrate the expected formats for count data and sample sheets, which can be used as templates for your own data.

Usage

Web Interface (Streamlit)

Launch the web application with:

streamlit run src/app/app.py

The Streamlit interface provides:

  • File upload forms for count data and sample sheets
  • Interactive visualization of prediction distributions
  • Sample selection for detailed analysis

image

Command-Line Interface (CLI)

For batch processing or integration into pipelines, use the CLI version:

python src/app/cli_app.py --count-data PATH_TO_COUNT_DATA.csv --sample-sheet PATH_TO_SAMPLE_SHEET.csv

CLI Options

  • --count-data: Path to count data CSV file with cells and gene expression values (required)
  • --sample-sheet: Path to sample sheet CSV file mapping cells to samples (required)
  • --output-dir: Directory to save results (default: ./output)
  • --models-dir: Directory containing model files (default: ./data_utils)
  • --sample: Analyze only a specific sample
  • --save-csv: Save predictions to a CSV file

Example:

python src/app/cli_app.py --count-data ./sample_data/sample_countdata.csv --sample-sheet ./sample_data/sample_names.csv --save-csv --output-dir output

Project Structure

├── data_utils/                          # Directory containing model and utility files
│   ├── nn_test_10kf/                    # Folder with neural network model files
│   ├── label_encoder_10kf.joblib        # Label encoder for subtype predictions
│   ├── MaxAbsScalar_10kf.joblib         # Pre-trained scaler for data normalization
│   └── top_feature_all_genes_10000.txt  # List of top 10,000 genes used in the model
├── output/                              # Directory for storing prediction outputs
│   └── *.png                            # Visualization files for prediction results
├── sample_data/                         # Directory with example input files for testing
│   ├── sample_countdata.csv             # Example count data file
│   └── sample_names.csv                 # Example sample sheet file
├── sc_env/                              # Python virtual environment directory
├── src/                                 # Source code directory
│   ├── *.ipynb                          # Jupyter notebooks for data analysis and model training
│   └── app/                             # Application code directory
│       ├── app.py                       # Streamlit web interface script
│       ├── cli_app.py                   # Command-line interface script
│       └── utils.py                     # Utility functions for data processing and predictions
├── LICENSE                              # License file (GNU General Public License v3.0)
├── ReadMe.md                            # Project documentation file
└── requirements.txt                     # Python dependencies file

How It Works

  1. Data Processing:

    • Loads single-cell RNA sequencing data and sample annotations
    • Filters for the genes included in the model
  2. Prediction Pipeline:

    • Normalizes gene expression data using a fitted MaxAbsScaler
    • Applies the neural network model to predict cell subtypes
    • Maps predictions back to samples
  3. Visualization:

    • Generates distribution plots of predicted subtypes per sample
    • Highlights the dominant subtype prediction

Model Details

The neural network model was trained on labeled single-cell RNA sequencing data from BLL samples. The model uses:

  • 10,000 selected genes as features
  • MaxAbsScaler for data normalization
  • A multi-layer neural network architecture
  • Output classes representing 13 BLL subtypes

Development

Data Preparation and Model Training

Jupyter notebooks in the src directory document the data preparation, feature selection, and model training process:

  • data_prep.ipynb: Data preprocessing
  • model.ipynb: Testing scikit-learn models
  • NN_model_features.ipynb: Feature selection
  • NN_model.ipynb: Neural network model development
  • shap_features.ipynb: Feature importance analysis using SHAP

Authors

Acknowledgements

Funding

  • NIH

License

GNU General Public License v3.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •