-
Notifications
You must be signed in to change notification settings - Fork 2
Description
I attempted to use the convert to cff helper script provided, however the format outputted is not matching the expected output and it appears the wiki is outdated on how to utilize the tool. The convert_cff helper script returns a bit of a mess, where sample names are cut off, columns are merged incorrectly, and it doesnt have all the columns that are "mandatory" for CFF format (t_gene1 on seems to be missing).
I made my own script to exactly match the format on the wiki:
cff_format <- c("chr1","pos1","strand1","chr2","pos2","strand2","library","sample_name", "sample_type","disease","tool",'split_cnt',"span_cnt","t_gene1","t_area1", "t_gene2","t_area2")
However, when I try to run the metafusion.sh, the "reformat" step changes my "strand1" and "strand2" columns to NA columns, then when that is passed onto "renamed" step, I get EMPTY files. I also get a whole bunch of errors.
except: raise ValueError("CFF Column pos1 value " + tmp[1] + " is not a valid integer\nInvalid entry: " + cff_line)
ValueError: CFF Column pos1 value pos1 is not a valid integer
Invalid entry: chr1 pos1 NA chr2 pos2 NA library sample_name sample_type disease tool split_cnt span_cnt t_gene1 t_area1 t_gene2 t_area2
Annotate cff, extract sequence surrounding breakpoint
2345953 annotations from /juno/work/ccs/pintoa1/fusion_report/metafusion/MetaFusion/reference_files/ens_known_genes.renamed.ENSG.bed loaded.
29.4318819046 sec. elapsed.
Warning: Input gene annotations include multiple chr, strand, or regions (5Mb away). Skipping current gene annotation.
set([('CKS1B', 'chr1', 'f'), ('CKS1B', 'chr5', 'r')])
Warning: Input gene annotations include multiple chr, strand, or regions (5Mb away). Skipping current gene annotation.
set([('MIR4461', 'chr5', 'f'), ('MIR4461', 'chr5', 'r')])
Warning: Input gene annotations include multiple chr, strand, or regions (5Mb away). Skipping current gene annotation.
set([('C2orf27A', 'chr2', 'f'), ('C2orf27A', 'chr2', 'r')])
[.....x500]
MetaFusion.sh: line 116: [: -eq: unary operator expected
MetaFusion.sh: line 121: [: -eq: unary operator expected
MetaFusion.sh: line 127: [: -eq: unary operator expected
Merge cff by genes and breakpoints
Traceback (most recent call last):
File "/juno/work/ccs/pintoa1/fusion_report/metafusion/MetaFusion/scripts/intersect_breakpoints_and_gene_names.py", line 41, in <module>
df = intersect_fusions_by_breakpoints()
File "/juno/work/ccs/pintoa1/fusion_report/metafusion/MetaFusion/scripts/intersect_breakpoints_and_gene_names.py", line 20, in intersect_fusions_by_breakpoints
fusion=pygeneann.CffFusion(lines[0])
IndexError: list index out of range
Error in read.table(fid_intersection_file, header = TRUE, stringsAsFactors = F) :
no lines available in input
Execution halted
Traceback (most recent call last):
File "/juno/work/ccs/pintoa1/fusion_report/metafusion/MetaFusion/scripts/generate_cluster_file.py", line 93, in <module>
fusion=pygeneann.CffFusion(lines[0])
IndexError: list index out of range
After the "reann" step, my cff file is completely empty and metafusion runs on all the empty files. I have successfully run your test CFF files through Metafusion, however cannot get a real example working.
Would it be possible for an update to the wiki to explain the exact format of CFF, whether or not NA's are allowed, the data type (int, string etc), and whether or not "disease" is important for analysis? At the moment we are putting NAs in the disease slot.
Im assuming that I am NOT supposed to have a header ing a cff format and that it MUST be in the order I specified above?