Skip to content

Frameshift correction #227

@hoelzer

Description

@hoelzer

It happens quite frequently that FSs are introduced in consensus sequences. In almost all cases these are errors.

Suggestion:

We could integrate a new tool proovframe to correct FS based on aligning reference protein sequences to the consensuses.

I just tried this yet with a single example sequence so this would need more proper benchmarking:

Top: original sequence w/ FS from poreCov
Middle: sequence after proovframe correction w/ all SC2 proteins as reference. However, this introduces another error in ORF1a likely due to the polyprotein structure of ORF1ab!
Bottom: Thus, I removed the protein sequence of the polyprotein from the reference FASTA and this seems to work. Sequence fixed

image

Reference protein FASTA used w/o the ORF1ab polyprotein:
GCF_009858895.2_ASM985889v3_protein_noORF1ab.faa.zip

Commands:

# map proteins to reads
proovframe/bin/proovframe map -a GCF_009858895.2_ASM985889v3_protein_noORF1ab.faa -o raw-seqs.tsv sample.consensus.fasta

# fix frameshifts in reads
proovframe/bin/proovframe fix -o corrected.fasta sample.consensus.fasta raw-seqs.tsv

However: I would suggest then providing these fs-corrected consensus sequences in addition to the default consensus sequences. It would need proper benchmarking to figure out if these corrections do not introduce any other potential errors for SARS-CoV-2 sequences.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions