-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy pathgtf2gff3.README
More file actions
executable file
·249 lines (193 loc) · 11.4 KB
/
gtf2gff3.README
File metadata and controls
executable file
·249 lines (193 loc) · 11.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
NAME
gtf2gff3
VERSION
This document describes version 0.1
SYNOPSIS
gtf2gff3 --cfg gtf2gff3_MY_CONFIG.cfg gtf_file > gff3_file
DESCRIPTION
This script will convert GTF formatted files to valid GFF3 formatted
files. It will map the value in column 3 (\"type\" column) to valid
SO, however because many non standard terms may appear in that column
in GTF files, you may edit the config file to provide your own GTF
feature to SO mapping. The script will also build gene models from
exons, CDSs and other features given in the GTF file. It is currently
tested on Ensemble and Twinscan GTF, and it should work on any other
files that follow those same specifications. It does not work on GTF
from the UCSC table browser because those files use the same ID for
gene and transcript, so it is impossible to group multiple transcripts
to a gene.
OPTIONS:
--cfg Provide the filename for a config file. See the configuration file
provided with this script for format details. Use this configuration
file to modify the behavior of the script. If no config file is given
it looks for ./gtf2gff3.cfg, ~/gtf2gff3.cfg or /etc/gtf2gff3.cfg in
that order.
--help Provide a detailed man page style help message and then exit.
INSTALLATION
This script requires the following perl packages that are available
from CPAN (www.cpan.org): Getopt::Long, use Config::Std. If these are not
already installed try:
perl -MCPAN -e shell
install Getopt::Long
install Config::Std
quit
After that the script is ready to run.
DESCRIPTION OF THE ALGORITHM
This script was designed to convert GTF formatted files to GFF3
format. It reads input from a GTF file and prints it's GFF3 output to
STDOUT. It was written based on and has been tested on GTF files from
Ensembl and Twinscan. It should work on similarly formatted GTF
files. It was also written to the extent possible to be robust about
missing features and will try to infer those features where
appropriate.
The first step is of course to parse the incoming GTF file. The
script requires the standard 9-column format, but two configuration
variables ATTRB_DELIMITER and ATTRB_REGEX allow flexibility in the 9th
(attributes) column. ATTRB_DELIMETER will determine the delimiter
between attributes, and ATTRB_REGEX will determine the regular
expression that will split the key value pairs. Both variables take
any valid perl regular expression.
The features present in a GTF file can vary quite a bit. Some have
exons, start codons, CDSs, stop codons and UTRs. Others have some
subset of those. This script will take those features and try to
build a valid gene model infering any missing features where
appropriate. You must have as a minimum at least exons or CDSs for
the script to infer a gene model. For example the script would throw
an error if it encountered an orphaned stop codon for instance. Gene
features in a GTF file have a gene_id and a transcript_id in the
attributes. The key terms that identify those IDs can be set in the
configuration file. However, if those IDs are not present then that
feature can not be associated with any gene or transcript. The script
does not try to do any unflattening of gene features based on
coordinates. For example if you have features that have transcript
IDs but no gene IDs, then no attempt will be made to cluster those
transcripts into genes and in fact no gene models would be built.
With regards to gene models the script limits itself to the features,
exon, CDS, start codon, stop codon, 5' UTR and 3' UTR. Those feature
may be named anything you want in your input GTF file as long as the
appropriate mappings are set up in the config file.
As a first step in constructing gene models the script checks for
start and stop codons. If they don't exist it tries to infer them
from the exons, CDSs and/or UTRs. It currently will assume a start or
stop codon from the appropriate end of a terminal CDS if an exon or
UTR is periferal to that CDS. It does not check coordinates to see if
those features are contiguous.
GFF3 requires start and stop codons to be part of the CDS. Many GTF
files do not include one or both (often the stop is excluded) within
the CDS. Two configuration variables can be used to direct the script
about whether or not terminal codons are included in the CDS within
your GTF file. These variables are START_IN_CDS and STOP_IN_CDS
respectively. A value of 1 indicates that your GTF file includes the
codon within the coordinates of the annotated CDSs. A value of 0
indicates that is does not. Defaults assume that start codons are
part of the CDS and stop codons are not within your GTF file.
The next step is to infer CDSs and UTRs from any appropriate
combination of exons, start codon, stop codon, CDSs and/or UTRs. If
exons are unavailable the script will try to infer them from any
combination of CDSs, start codon, stop codon and/or UTRs. In both
cases coordinates are consulted to be sure we're "doing the right
thing".
If CDS phase is annotated it is not validated, however if CDSs are
infered and a start codon is annoated or infered then CDSs phase is
set.
As a final step in building a gene model, the script checks all
features within each transcript and feature to be sure that each feild
is filled and is consistent with other features associated with the
same transcript and gene. Since genes and transcripts are not
annotated in GTF these features are constructed for the GFF3 output.
Gene and transcript boundaries are simply assumed to be the minimum
and maximum coordiantes of all contained features.
EXAMPLE USAGE
Consider the following GTF
chr1 protein_coding exon 28163331 28164986 . + . gene="gene_2" | mRNA="trnsc_5"
chr1 protein_coding CDS 28163331 28164986 . + 0 gene="gene_2" | mRNA="trnsc_5"
chr1 protein_coding exon 28165075 28165231 . + . gene="gene_2" | mRNA="trnsc_5"
chr1 protein_coding CDS 28165075 28165231 . + 0 gene="gene_2" | mRNA="trnsc_5"
chr1 protein_coding exon 28173088 28173224 . + . gene="gene_2" | mRNA="trnsc_5"
chr1 protein_coding CDS 28173088 28173224 . + 2 gene="gene_2" | mRNA="trnsc_5"
chr1 protein_coding exon 28176514 28176665 . + . gene="gene_2" | mRNA="trnsc_5"
chr1 protein_coding CDS 28176514 28176665 . + 0 gene="gene_2" | mRNA="trnsc_5"
chr1 protein_coding exon 28176847 28176950 . + . gene="gene_2" | mRNA="trnsc_5"
chr1 protein_coding CDS 28176847 28176950 . + 1 gene="gene_2" | mRNA="trnsc_5"
chr1 protein_coding exon 28181630 28181713 . + . gene="gene_2" | mRNA="trnsc_5"
chr1 protein_coding CDS 28181630 28181713 . + 2 gene="gene_2" | mRNA="trnsc_5"
chr1 protein_coding exon 28187071 28187711 . + . gene="gene_2" | mRNA="trnsc_5"
chr1 protein_coding CDS 28187071 28187381 . + 2 gene="gene_2" | mRNA="trnsc_5"
Let's assume that the start codon coordinates are included within the
CDS, but that the stop codon is not. We make the following settings
in the configuration file to account for that:
START_IN_CDS = 1
STOP_IN_CDS = 0
We see that the gene ID is annotated as gene="gene_2" and the
transcript ID is annotated as mRNA="trnsc_5" and that attributes are
seperated by a vertical bar "|". We adjust the configuration file as follows:
ATTRB_DELIMITER = \s*|\s*
ATTRB_REGEX = ^\s*(\S+)=(\"[^\"]+\")\s*$
[GTF_ATTRB_MAP]
#Code Tag #GTF Tag
gene_id = gene
trnsc_id = mRNA
The above input would provide the following GFF3 output:
chr1 protein_coding gene 28163331 28187711 . + . ID=gene_2;
chr1 protein_coding mRNA 28163331 28187711 . + . ID=trnsc_5; PARENT=gene_2;
chr1 protein_coding exon 28163331 28164986 . + . ID=exon:trnsc_5:1; PARENT=trnsc_5;
chr1 protein_coding exon 28165075 28165231 . + . ID=exon:trnsc_5:2; PARENT=trnsc_5;
chr1 protein_coding exon 28173088 28173224 . + . ID=exon:trnsc_5:3; PARENT=trnsc_5;
chr1 protein_coding exon 28176514 28176665 . + . ID=exon:trnsc_5:4; PARENT=trnsc_5;
chr1 protein_coding exon 28176847 28176950 . + . ID=exon:trnsc_5:5; PARENT=trnsc_5;
chr1 protein_coding exon 28181630 28181713 . + . ID=exon:trnsc_5:6; PARENT=trnsc_5;
chr1 protein_coding exon 28187071 28187711 . + . ID=exon:trnsc_5:7; PARENT=trnsc_5;
chr1 protein_coding CDS 28163331 28164986 . + 0 ID=CDS:trnsc_5:1; PARENT=trnsc_5;
chr1 protein_coding CDS 28165075 28165231 . + 0 ID=CDS:trnsc_5:2; PARENT=trnsc_5;
chr1 protein_coding CDS 28173088 28173224 . + 2 ID=CDS:trnsc_5:3; PARENT=trnsc_5;
chr1 protein_coding CDS 28176514 28176665 . + 0 ID=CDS:trnsc_5:4; PARENT=trnsc_5;
chr1 protein_coding CDS 28176847 28176950 . + 1 ID=CDS:trnsc_5:5; PARENT=trnsc_5;
chr1 protein_coding CDS 28181630 28181713 . + 2 ID=CDS:trnsc_5:6; PARENT=trnsc_5;
chr1 protein_coding CDS 28187071 28187381 . + 2 ID=CDS:trnsc_5:7; PARENT=trnsc_5;
CONFIGURATION AND ENVIRONMENT
A configuration file is provided with this script. The script will
look for that configuration file in ./gtf2gff3.cfg, ~/gtf2gff3.cfg
or /etc/gtf2gff3.cfg in that order. If the configuration file is not
found in one of those locations and one is not provided via the --cfg
flag it will try to choose some sane defaults, but you really should
provide the configuration file. See the supplied configuration file
itself as well as the README that came with this package for format
and details about the configuration file.
DEPENDENCIES
This script requires the following perl packages that are available
from CPAN (www.cpan.org).
Getopt::Long
use Config::Std
INCOMPATIBILITIES
None reported.
BUGS AND LIMITATIONS
No bugs have been reported.
Please report any bugs or feature requests to:
barry dot moore at genetics dot utah dot edu
AUTHOR
Barry Moore
barry dot moore at genetics dot utah dot edu
LICENCE AND COPYRIGHT
Copyright (c) 2007, University of Utah
This module is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
DISCLAIMER OF WARRANTY
BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH
YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL
NECESSARY SERVICING, REPAIR, OR CORRECTION.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE
LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL,
OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE
THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.