Skip to content

CFF file format #6

@pintoa1-mskcc

Description

@pintoa1-mskcc

I attempted to use the convert to cff helper script provided, however the format outputted is not matching the expected output and it appears the wiki is outdated on how to utilize the tool. The convert_cff helper script returns a bit of a mess, where sample names are cut off, columns are merged incorrectly, and it doesnt have all the columns that are "mandatory" for CFF format (t_gene1 on seems to be missing).

I made my own script to exactly match the format on the wiki:
cff_format <- c("chr1","pos1","strand1","chr2","pos2","strand2","library","sample_name", "sample_type","disease","tool",'split_cnt',"span_cnt","t_gene1","t_area1", "t_gene2","t_area2")

However, when I try to run the metafusion.sh, the "reformat" step changes my "strand1" and "strand2" columns to NA columns, then when that is passed onto "renamed" step, I get EMPTY files. I also get a whole bunch of errors.

    except: raise ValueError("CFF Column pos1 value " + tmp[1] + " is not a valid integer\nInvalid entry: " + cff_line)
ValueError: CFF Column pos1 value pos1 is not a valid integer
Invalid entry: chr1	pos1	NA	chr2	pos2	NA	library	sample_name	sample_type	disease	tool	split_cnt	span_cnt	t_gene1	t_area1	t_gene2	t_area2

Annotate cff, extract sequence surrounding breakpoint
2345953 annotations from /juno/work/ccs/pintoa1/fusion_report/metafusion/MetaFusion/reference_files/ens_known_genes.renamed.ENSG.bed loaded.
29.4318819046 sec. elapsed.
Warning: Input gene annotations include multiple chr, strand, or regions (5Mb away). Skipping current gene annotation.
set([('CKS1B', 'chr1', 'f'), ('CKS1B', 'chr5', 'r')])
Warning: Input gene annotations include multiple chr, strand, or regions (5Mb away). Skipping current gene annotation.
set([('MIR4461', 'chr5', 'f'), ('MIR4461', 'chr5', 'r')])
Warning: Input gene annotations include multiple chr, strand, or regions (5Mb away). Skipping current gene annotation.
set([('C2orf27A', 'chr2', 'f'), ('C2orf27A', 'chr2', 'r')])
[.....x500]
MetaFusion.sh: line 116: [: -eq: unary operator expected
MetaFusion.sh: line 121: [: -eq: unary operator expected
MetaFusion.sh: line 127: [: -eq: unary operator expected
Merge cff by genes and breakpoints
Traceback (most recent call last):
  File "/juno/work/ccs/pintoa1/fusion_report/metafusion/MetaFusion/scripts/intersect_breakpoints_and_gene_names.py", line 41, in <module>
    df = intersect_fusions_by_breakpoints()
  File "/juno/work/ccs/pintoa1/fusion_report/metafusion/MetaFusion/scripts/intersect_breakpoints_and_gene_names.py", line 20, in intersect_fusions_by_breakpoints
    fusion=pygeneann.CffFusion(lines[0])
IndexError: list index out of range
Error in read.table(fid_intersection_file, header = TRUE, stringsAsFactors = F) : 
  no lines available in input
Execution halted
Traceback (most recent call last):
  File "/juno/work/ccs/pintoa1/fusion_report/metafusion/MetaFusion/scripts/generate_cluster_file.py", line 93, in <module>
    fusion=pygeneann.CffFusion(lines[0])
IndexError: list index out of range

After the "reann" step, my cff file is completely empty and metafusion runs on all the empty files. I have successfully run your test CFF files through Metafusion, however cannot get a real example working.

Would it be possible for an update to the wiki to explain the exact format of CFF, whether or not NA's are allowed, the data type (int, string etc), and whether or not "disease" is important for analysis? At the moment we are putting NAs in the disease slot.

Im assuming that I am NOT supposed to have a header ing a cff format and that it MUST be in the order I specified above?

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdocumentationImprovements or additions to documentation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions