*This repository contains the code for the paper
ScisTreeCNA is a GPU-accelerated Python package designed for reconstructing Cell Lineage Trees (CLTs) from single-cell DNA sequencing data. It addresses a key challenge by integrating both single nucleotide variants (SNVs) and copy number abberations (CNAs) within a unified probabilistic framework. This approach provides accurate and scalable inference for modern single-cell datasets containing both SNV and CNA information.
Note: If you do not have copy-number data and want to infer a cell lineage tree from SNVs only, please use ScisTree2.
ScisTreeCNA relies exclusively on CUDA to achieve high-performance probabilistic reconstruction. CPU-only environments are not supported.
However, anyone can try ScisTreeCNA using Google Colab with free T4 GPU access: . To get started with colab, first make your own copy of the notebook and then you can modify your copy to try to use new data.
| Component | Requirement | Note |
|---|---|---|
| Operating System | Linux/macOS/Windows | |
| GPU | NVIDIA CUDA GPU | Compute Capability 3.0 or higher is required. |
| Toolkit | CUDA Runtime | Essential for GPU operation and can typically be installed alongside CuPy (see Installation). |
| Python & Pip | Python |
|
| Conda | Miniconda/Anaconda |
*We have successfully tested it on Linux, macOS, and Windows.
This method simplifies dependency management by allowing conda to install the matched CuPy and CUDA Runtime Libraries together.
-
You can either create a fresh environment or use an existing one.
conda create -n scistreecna python=3.12 conda activate scistreecna
-
Install CuPy along with CUDA Runtime
# Example: Install CuPy with CUDA 12.8 Runtime conda install -c conda-forge cupy cuda-version=12.8Note: Installing CuPy with
condaautomatically manages and installs the specific CUDA runtime libraries required for ScisTreeCNA to operate, greatly simplifying the setup. Check for more details at the CuPy official website. -
Install ScisTreeCNA
Clone the repository and install the package using
pip. Since CUDA dependencies are already handled by CuPy, use the standard local installation:git clone https://github.com/haotianzh/ScisTreeCNA.git cd ScisTreeCNA pip install .
If you already have the NVIDIA CUDA Toolkit installed on your system (or your current conda env) and only need to install ScisTreeCNA and the CuPy that version compatible with your existing setup, you can try:
-
Clone ScisTreeCNA
git clone https://github.com/haotianzh/ScisTreeCNA.git cd ScisTreeCNA -
Install ScisTreeCNA with Specific CuPy Dependency
Use the
pip install .[extra]syntax, replacing the extra name with your CUDA major version to ensure compatibility:- For CUDA 11.x:
pip install .[cuda11x]
- For CUDA 12.x:
pip install .[cuda12x]
- For CUDA 13.x:
pip install .[cuda13x]
- For CUDA 11.x:
Once installed, ScisTreeCNA can be used directly from your Python environment.
To run the inference, you must prepare an input .csv file where rows represent sites and columns represent cells. For each (cell, site) pair, the entry is a string in the format #ref|#alt|#cn, where:
- #ref: read count of the reference (wild-type) allele
- #alt: read count of the mutant allele
- #cn: observed copy number (either absolute copy number — recommended — or relative copy state)
Missing values should be encoded as:
.|.|#cn— if read counts are missing but the copy number is available#ref|#alt|.- if only the copy number is missing.|.|.— if both read counts and copy number are missing
Example input files are provided in the examples/ directory for reference.
This example demonstrates loading example data and running the tree reconstruction function.
import scistreecna as scna
# scna.set_cuda_device(1) # set to gpu:1
# load example data
reads, cell_names, site_names = scna.util.read_csv('./examples/test_data_reads.csv')
# run inference
scistreecna_tree, scistreecna_geno = scna.infer(
reads,
cell_names=cell_names, # cell names
ado=0.1, # allelic dropout rate
seq_error=0.01, # sequencing error
cn_noise=0.05, # copy number noise
cn_min=1, # minimum copy number (>=1)
cn_max=5, # maximum copy number
tree_batch_size=128, # number of trees evaluated in parallel
node_batch_size=256, # number of nodes evaluated in parallel
verbose=True, # print logs
)
print(scistreecna_tree) # print inferred tree
print(scistreecna_geno) # print imputed binary genotypeMore usage examples can be found in our tutorials.
We also provide a CLI tool that accepts a .csv file as input and saves the inferred tree and imputed genotypes to a user-specified output location.
To run inference on the example data using the CLI:
scistreecna --input ./examples/test_data_reads.csv --output test_outputDetailed usage:
usage: scistreecna [-h] --input INPUT [--output OUTPUT] [--cn-min CN_MIN] [--cn-max CN_MAX] [--ado ADO] [--seq-error SEQ_ERROR] [--af AF] [--max-iter MAX_ITER] [--cn-noise CN_NOISE] [--tree-batch TREE_BATCH] [--node-batch NODE_BATCH] [--verbose]
CLI for ScisTreeCNA inference.
options:
-h, --help show this help message and exit
--input INPUT, -i INPUT Path to input reads file (see https://github.com/haotianzh/ScisTreeCNA/blob/main/examples/test_data_reads.csv)
--output OUTPUT, -o OUTPUT Prefix for output files. Saves as {prefix}_tree.txt and {prefix}_genotype.txt (default: 'output')
--cn-min CN_MIN Minimum copy number (default: 1)
--cn-max CN_MAX Maximum copy number (default: 5)
--ado ADO Allelic dropout rate (default: 0.1)
--seq-error SEQ_ERROR Sequencing error rate (default: 0.01)
--af AF Allele Frequency (default: 0.5)
--max-iter MAX_ITER Maximal local search iteration (default: infinity)
--cn-noise CN_NOISE Copy number noise (default: 0.05)
--tree-batch TREE_BATCH Tree batch size (default: 64)
--node-batch NODE_BATCH Node batch size (default: 64)
--verbose Enable verbose logs (default: False)
The accompanying simulator, scsim, for generating reads with copy-number gains and losses is available here.
Post your issues here inside GitHub repositary if you have questions/issues.
