PUREE is a compact and fast method for predicting tumor purity (cancer cell fraction) from bulk gene expression data. The methodology and the validation process is described in the publication PUREE: accurate pan-cancer tumor purity estimation from gene expression data.
For the input, PUREE requires a gene expression matrix in the .csv or .tsv format. The normalization space can be potentially anything - from FPKM and TPM, to counts and microarray data. Preferably, the matrix has to be oriented with samples as rows and genes as columns ([samples, features] shape), although the method will try to check the orientation. The gene identifiers can be passed as either ENSEMBL IDs (default) or HGNC symbols.
As an output, PUREE will return tumor purity values per every sample in the input gene expression matrix.
To run PUREE, you will need an access to UNIX-like command line and a Python 3.8+ environment. Additionally, you may need to install several dependencies.
The dependencies list for PUREE consists of the following Python packages:
scikit-learn==1.1.2
numpy==1.23.2
pandas==1.4.3
joblib==1.1.0
You can either install those manually one by one, use a pre-cofigured conda environment, or run the command below to install all of them at once with pip:
pip3 install -r requirements.txt To download PUREE, simply clone the GitHub repository to your machine (you might need to install GitHub CLI first):
gh repo clone skandlab/PUREE # downloads the package
cd PUREE # moves into the installation directoryTo run PUREE in command line, run the following command from inside the installation directory:
# run from inside the installation directory
python3 predict_purity.py --data_path input_matrix_path \
--output output_path \
[--gene_identifier_type {HGNC,ENSEMBL}] # optional argument to specify gene id nomenclature; default: ENSEMBLNote: PUREE was tested using Python 3 environment on Ubuntu 18 and Windows 10, and should generally run on all platforms. If you are using older version of the dependencies you might run into soft warnings, but the method will still return correct values.
Example commands to run PUREE would be
# for ENSEMBL IDs
python3 predict_purity.py --data_path data.csv \
--output purities.tsv # for HGNC symbols
python3 predict_purity.py --data_path data.csv \
--output purities.tsv \
--gene_identifier_type HGNCThe /tests directory contains a simple test dataset (based on a few samples of TCGA PCPG data) and the precomputed purities to check whether your installation works as intended. Predict the test purities using the following:
python3 predict_purity.py --data_path tests/expression.tsv \
--output tests/purities_predicted.tsv This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Deprecated, August 2025
PUREE is a compact and fast method for predicting tumor purity (cancer cell fraction) from bulk gene expression data. The methodology and the validation process is described in the paper PUREE: accurate pan-cancer tumor purity estimation from gene expression data.
The PUREE class is a wrapper class that exposes functionality to interact with the PUREE code. The class allows you to monitor the health of the backend, submit a file for processing, and get the logs and the output.
This package is tested on Python versions above 3.8. In addition, it has the following dependencies
pandas==1.5.3
requests==2.28.2
To install PUREE, run the following in the command line:
git clone https://github.com/skandlab/PUREE
cd PUREE
python3 setup.py bdist_wheel
pip install dist/PUREE-0.1.0-py3-none-any.whl --force-reinstall # this can be installed in the environment of your choiceNow you can use PUREE in your Python environment:
from puree import *
p = PUREE()
purities_and_logs = p.get_output(test_data_path, gene_id_nomenclature)where
| variable | description |
|---|---|
| test_data_path | string; path to the gene expression matrix in .csv or .tsv |
| gene_id_nomenclature | string; gene ids nomenclature: 'ENSEMBL' or 'HGNC' |
PUREE expects a gene expression matrix as input, in any normalization space, preferably oriented with genes as columns and samples as rows. The gene identifiers can be passed as either ENSEMBL IDs or HGNC gene symbols.
More specifically, the expected input would schematically look like this
| gene_id_1 | gene_id_2 | |
|---|---|---|
| sample_id_1 | 10 | 1 |
| sample_id_2 | 0 | 10 |
PUREE returns a .tsv file with tumor purities in the first column. The output and the logs are stored in a dictionary: {"output": purity_dataframe, "logs": PUREE_logs}.
Note: the sample names will be anonymized. However, the purities are returned in the same order as the samples in the input.
