A machine learning tool for predicting B-cell Lymphoblastic Leukemia (BLL) subtypes from single-cell RNA sequencing data.
CM-ScBLL is designed to predict BLL subtypes from single-cell RNA sequencing data using a neural network model. The project provides both a web interface (Streamlit) and a command-line interface (CLI) for making predictions.
- Predict BLL subtypes from single-cell gene expression data
- Process bulk cell samples and generate prediction distributions
- Visualize subtype predictions with interactive plots
- Export prediction results for further analysis
- Python 3.9+
- Git
git clone -b main https://github.com/ChildrensMercyResearchInstitute/cm-scbll
cd cm-scbll# Create virtual environment
python3 -m venv sc_env
# Activate virtual environment
# macOS/Linux
source sc_env/bin/activate
# Windows
sc_env\Scripts\activatepip install -r requirements.txtA CSV file with cells as rows and genes as columns:
- First column should contain cell identifiers (
samplesas the column name) - Remaining columns should contain gene expression values (gene symbols as column names)
A CSV file mapping cell identifiers to sample IDs:
cell_sample: Cell identifier matching the count data filesampleorpatient: Sample/patient identifier
Please refer to the sample_data directory for example input files. These files demonstrate the expected formats for count data and sample sheets, which can be used as templates for your own data.
Launch the web application with:
streamlit run src/app/app.pyThe Streamlit interface provides:
- File upload forms for count data and sample sheets
- Interactive visualization of prediction distributions
- Sample selection for detailed analysis
For batch processing or integration into pipelines, use the CLI version:
python src/app/cli_app.py --count-data PATH_TO_COUNT_DATA.csv --sample-sheet PATH_TO_SAMPLE_SHEET.csv--count-data: Path to count data CSV file with cells and gene expression values (required)--sample-sheet: Path to sample sheet CSV file mapping cells to samples (required)--output-dir: Directory to save results (default: ./output)--models-dir: Directory containing model files (default: ./data_utils)--sample: Analyze only a specific sample--save-csv: Save predictions to a CSV file
Example:
python src/app/cli_app.py --count-data ./sample_data/sample_countdata.csv --sample-sheet ./sample_data/sample_names.csv --save-csv --output-dir output├── data_utils/ # Directory containing model and utility files
│ ├── nn_test_10kf/ # Folder with neural network model files
│ ├── label_encoder_10kf.joblib # Label encoder for subtype predictions
│ ├── MaxAbsScalar_10kf.joblib # Pre-trained scaler for data normalization
│ └── top_feature_all_genes_10000.txt # List of top 10,000 genes used in the model
├── output/ # Directory for storing prediction outputs
│ └── *.png # Visualization files for prediction results
├── sample_data/ # Directory with example input files for testing
│ ├── sample_countdata.csv # Example count data file
│ └── sample_names.csv # Example sample sheet file
├── sc_env/ # Python virtual environment directory
├── src/ # Source code directory
│ ├── *.ipynb # Jupyter notebooks for data analysis and model training
│ └── app/ # Application code directory
│ ├── app.py # Streamlit web interface script
│ ├── cli_app.py # Command-line interface script
│ └── utils.py # Utility functions for data processing and predictions
├── LICENSE # License file (GNU General Public License v3.0)
├── ReadMe.md # Project documentation file
└── requirements.txt # Python dependencies file
-
Data Processing:
- Loads single-cell RNA sequencing data and sample annotations
- Filters for the genes included in the model
-
Prediction Pipeline:
- Normalizes gene expression data using a fitted MaxAbsScaler
- Applies the neural network model to predict cell subtypes
- Maps predictions back to samples
-
Visualization:
- Generates distribution plots of predicted subtypes per sample
- Highlights the dominant subtype prediction
The neural network model was trained on labeled single-cell RNA sequencing data from BLL samples. The model uses:
- 10,000 selected genes as features
- MaxAbsScaler for data normalization
- A multi-layer neural network architecture
- Output classes representing 13 BLL subtypes
Jupyter notebooks in the src directory document the data preparation, feature selection, and model training process:
data_prep.ipynb: Data preprocessingmodel.ipynb: Testing scikit-learn modelsNN_model_features.ipynb: Feature selectionNN_model.ipynb: Neural network model developmentshap_features.ipynb: Feature importance analysis using SHAP
- Tarun Mamidi
- Email: tmamidi@cmh.edu
- NIH
GNU General Public License v3.0
