Java implementation of MUTClass algorithm for the identification of targeted panels of genes able to classify cancer patients. The Java package includes the following 3 software:
- DMGSFinder: Extract from a mutation matrix a panel of mutated genes which maximizes (or minimizes) the differential coverage between two classes of samples.
- MUTClassCV: Run a cross-validation test of MUTClass algorithm on a mutation matrix
- MUTClass: Train MUTClass on a mutation matrix A and test the algorithm on a mutation matrix B in order to classify samples of B.
DMGSFinder is a greedy algorithm for maximizing (or minimizing) the differential coverage with respect to a set of positive samples and a set of negative samples. DMGSFinder is the algorithm used by MUTClass to solve the k-MaxDiffCov (or k-MinDiffCov) problem
Usage:
java -cp ./out DMGSFinder -m <mutationsFile> -d <listDriverGenes> -k <panelSize> -min -o <resultsFile>
REQUIRED PARAMETERS:
-m Mutation matrix file
OPTIONAL PARAMETERS:
-d List of driver genes (default=matrix file with classes)
-k Size of gene panel (default=10)
-min Solve the k-MinDiffCov problem (default=solve k-MaxDiffCov problem)
-o Results file (default=print results to standard output)
Files format:
MUTATION MATRIX FILE
The first line contains the names of the samples. If no list of driver genes is provided, the second line must contain sample classes. The following lines contain the name of the gene followed by a list of numbers denoting the mutation frequency of the gene in each sample (0 means that a gene is mutated in the sample, any other value means that the gene is mutated in the sample). Values, sample names (and sample classes, if provided) are separated by tabs (\t).
Example without sample classes:
Sample1 Sample2 Sample3 Sample4
Gene1 0 0 2 1
Gene2 1 0 3 0
Example with sample classes:
Sample1 Sample2 Sample3 Sample4
P P N P
Gene1 0 0 2 1
Gene2 1 0 3 0
RESULTS FILE
The results file contains the panel found and the following statistics separated by tab character (\t):
- Average coverage in the set of positive samples;
- Average coverage in the set of negative samples;
- Differential coverage.
Example:
DMGS Average positive coverage Average negative coverage Differential coverage
[DOCK11, LOC100506083, TMEM138, IL1R2, CASP4, IL31RA, PLEKHA6, SPATA31C1, FER1L6, TANC2] 74.39% 19.0% 55.39%
Examples:
- Find a panel with 10 genes which maximizes the differential coverage, starting from a set of driver genes D = {BRCA1,BRCA2}. The positive set will be the set of samples where at least one gene among BRCA1 and BRCA2 is mutated, the negative set will be the one in which neither BRCA1 nor BRCA2 are mutated.
java -cp ./out DMGSFinder -m Data/BRCA_snp_gene_matrix.txt -d BRCA1,BRCA2
- Find a panel with 20 genes which minimizes the differential coverage, starting from a mutation matrix file with sample classes and save final results to "results.txt"
java -cp ./out DMGSFinder -m Data/BRCA_snp_gene_matrix_with_classes.txt -min -k 20 -o results.txt
MUTClassCV performs a cross-validation test of MUTClass algorithm on a mutation matrix
Usage:
java -cp ./out MUTClassCV -m <mutationsFile> -d <listDriverGenes> -kmax <positivePanelSize> -kmin <negativePanelSize> -cv <crossValidationIterations> -f <crossValidationFolds>
REQUIRED PARAMETERS:
-m Mutation matrix file
OPTIONAL PARAMETERS:
-d List of driver genes (default=matrix file with classes)
-kmax Size of positive gene panel (default=10)
-kmin Size of negative gene panel (default=10)
-cv Number of iterations of cross validation (default=10)
-f Number of folds for cross validation (default=5)
Files format:
MUTATION MATRIX FILE
Same mutation matrix file format described for DMGSFinder. If no list of driver genes is provided, sample classes must be provided in the mutation matrix file.
Output:
A list of performance statistics returned by test cross-validation test, including true positives, true negatives, false positives, false negatives, unclassified samples, precision, recall, FPR, FNR, specificity, accuracy and F1 score.
Example:
Run 5-fold 10 times repeated cross-validation on BRCA mutation matrix with sample classes, with kmax=5 and kmin=50.
java -cp ./out MUTClassCV -m Data/BRCA_snp_gene_matrix_with_classes.txt -kmax 5 -kmin 50 -cv 10 -f 5
Train MUTClass on a mutation matrix A and test the algorithm on a mutation matrix B in order to classify samples of B.
Usage:
java -cp ./out MUTClass -train <trainingSetFile> -test <testSetFile> -d <listDriverGenes> -kmax <positivePanelSize> -kmin <negativePanelSize> -o <resultsFile>
REQUIRED PARAMETERS:
-train Training mutation matrix file
-test Test mutation matrix file
OPTIONAL PARAMETERS:
-d List of driver genes (default=training matrix file with classes)
-kmax Size of positive gene panel (default=5)
-kmin Size of negative gene panel (default=50)
-o Output result file (default='results.txt')
Files format:
TRAINING MUTATION MATRIX FILE
Same mutation matrix file format described for DMGSFinder. If no list of driver genes is provided, sample classes must be provided in the mutation matrix file.
TEST MUTATION MATRIX FILE
A mutation matrix file with no sample classes in the same format described for DMGSFinder.
OUTPUT RESULT FILE
A text file listing the positive panel, the negative panel and the predicted classes for each sample in the test mutation matrix.
Example:
Positive panel:
[ACAB1, BRAF, DOCK11, LOC100506083, TMEM138]
Negative panel:
[IL1R2, CASP4, IL31RA, PLEKHA6, SPATA31C1]
Predicted classes:
Sample1 P
Sample2 N
Sample3 N
Sample4 P
Sample5 P
Example:
Train MUTClass with kmax=10 and kmin=20 on BRCA mutation matrix with sample classes and test it on PCAWG-BRCA mutation matrix.
java -cp ./out MUTClass -train Data/BRCA_snp_gene_matrix_with_classes.txt -test Data/PCAWG-BRCA_gene_matrix.txt -kmax 10 -kmin 20