Skip to content

Commit bcdd7fc

Browse files
committed
clean before vacation
1 parent 0d3474d commit bcdd7fc

File tree

14 files changed

+390
-487
lines changed

14 files changed

+390
-487
lines changed

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -274,7 +274,7 @@ This hierarchical information is provided in the same manner in the aggregate fi
274274
| AggregateID | String | ID assigned after the feature under which the aggregation was done |
275275
| ParentType | String | Type of the parent of the feature under which the aggregation was done |
276276
| AggregateType | String | Type of the features that are aggregated |
277-
| AggregationMode | `all_isoforms`, `longest_isoform`, `chimaera` or `all-sites` | Way in which the aggregation was performed |
277+
| AggregationMode | `all_isoforms`, `longest_isoform`, `chimaera`, `feature` or `all-sites` | Way in which the aggregation was performed |
278278
| CoveredSites | Positive integer | Number of sites in the aggregated features that satisfy the minimum level of coverage |
279279
| GenomeBases | Comma-separated positive integers | Frequencies of the bases in the aggregated features in the reference genome (order: A, C, G, T) |
280280
| SiteBasePairings | Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found in the aggregated features (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) |
@@ -303,6 +303,9 @@ The existence of alternative transcripts of a same gene causes some complication
303303
304304
3. **Chimaera** (*Chimaera* in the figure): Report the counts from the union of feature ranges over all the isoforms. Its ID is composed of the ID of the gene plus "-chimaera". The aggregation types of chimaeras are postfixed with "-chimaera" as well.
305305
306+
4. **Feature**
307+
Standard mode for regular features. Aggregates data from sub-features (children) of a given feature. For example, for an exon or CDS, it aggregates the counts of all its constituent elements.
308+
306309
In the example below, a gene has three transcripts. For the **longest isoform** aggregation, Transcript 1 would be selected, because it has the greatest sum of exon lengths (numbers under the exon boxes). For the **all isoforms** aggregation, all the transcripts (1, 2, and 3) would be used. For **chimaera** aggregation, the aggregation ranges are the union of the ranges of the exons of all the transcripts. Therefore, the total length of the chimaeric features is always equal ot greater than the longest transcript.
307310
308311
![alt text](doc/img/aggregation_modes.png)
Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ def print_help():
1010
DRIP - RNA Editing Analysis Tool
1111
1212
DESCRIPTION:
13-
This script analyzes RNA editing from RAIN aggregate files. It calculates
13+
This script analyzes RNA editing from standardized puviometer files. It calculates
1414
two key metrics for all 16 genome-variant base pair combinations across multiple
1515
samples and combines them into a unified matrix format.
1616
@@ -35,7 +35,7 @@ def print_help():
3535
(order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT)
3636
3737
CALCULATED METRICS:
38-
For each aggregate feature, the script calculates metrics for all 16 base pair combinations:
38+
For each line, the script calculates metrics for all 16 base pair combinations:
3939
4040
For each combination XY (where X = genome base, Y = read base):
4141
@@ -61,10 +61,11 @@ def print_help():
6161
Metadata columns (first 6 columns):
6262
- SeqID: Sequence/chromosome identifier
6363
- ParentIDs: Parent feature identifiers
64-
- AggregateID: Unique aggregate identifier
65-
- ParentType: Type of parent feature
66-
- AggregateType: Type of aggregate feature
67-
- AggregationMode: Mode of aggregation used
64+
- ID: Unique identifier
65+
- Ptype: Type of Parent feature
66+
- Type: Type of feature
67+
- Ctype: Type of Children feature
68+
- Mode: Mode of aggregation used if any (e.g., 'all_sites', 'edited_sites', 'edited_reads')
6869
6970
Metric columns (for each sample):
7071
- GROUP::SAMPLE::REPLICATE::espf: XY sites proportion in feature (XY sites / X bases)
@@ -94,7 +95,7 @@ def print_help():
9495
- results_TA.tsv, results_TC.tsv, results_TG.tsv, results_TT.tsv
9596
9697
Each file has columns:
97-
SeqID, ParentIDs, AggregateID, ParentType, AggregateType, AggregationMode,
98+
SeqID, ParentIDs, ID, Ptype, Ctype, Mode,
9899
control::sample1::rep1::rain_sample1::espf, control::sample1::rep1::rain_sample1::espr,
99100
control::sample2::rep2::rain_sample2::espf, control::sample2::rep2::rain_sample2::espr,
100101
treated::sample1::rep1::rain_sample3::espf, treated::sample1::rep1::rain_sample3::espr
@@ -118,9 +119,8 @@ def parse_tsv_file(filepath, group_name, sample_name, replicate, file_id, includ
118119
"""Parse a single TSV file and extract editing metrics for all base pair combinations."""
119120
df = pd.read_csv(filepath, sep='\t')
120121

121-
# DO NOT filter out rows where AggregateID is '.'
122+
# DO NOT filter out rows where ID is '.'
122123
# These are special aggregate rows (e.g., all_sites) that should be kept
123-
124124
# Base pair combinations in order
125125
base_pairs = ['AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT',
126126
'GA', 'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT']
@@ -129,7 +129,7 @@ def parse_tsv_file(filepath, group_name, sample_name, replicate, file_id, includ
129129
# Parse GenomeBases (order: A, C, G, T)
130130
for i, base in enumerate(bases):
131131
df[f'{base}_count'] = df['GenomeBases'].str.split(',').str[i].astype(int)
132-
132+
133133
# Parse SiteBasePairings (all 16 combinations)
134134
for i, bp in enumerate(base_pairs):
135135
df[f'{bp}_sites'] = df['SiteBasePairings'].str.split(',').str[i].astype(int)
@@ -139,7 +139,7 @@ def parse_tsv_file(filepath, group_name, sample_name, replicate, file_id, includ
139139
df[f'{bp}_reads'] = df['ReadBasePairings'].str.split(',').str[i].astype(int)
140140

141141
# Calculate metrics for each base pair combination
142-
metadata_cols = ['SeqID', 'ParentIDs', 'AggregateID', 'ParentType', 'AggregateType', 'AggregationMode']
142+
metadata_cols = ['SeqID', 'ParentIDs', 'ID', 'Mtype', 'Ptype', 'Type', 'Ctype', 'Mode', 'Start', 'End', 'Strand']
143143
result_cols = metadata_cols.copy()
144144

145145
# Create column prefix with group::sample::replicate::file_id or group::sample::replicate
@@ -210,18 +210,18 @@ def merge_samples(file_group_sample_replicate_dict, output_prefix, include_file_
210210
replicate_list.append(replicate)
211211

212212
# Merge all samples based on metadata columns
213-
metadata_cols = ['SeqID', 'ParentIDs', 'AggregateID', 'ParentType', 'AggregateType', 'AggregationMode']
213+
metadata_cols = ['SeqID', 'ParentIDs', 'ID', 'Mtype', 'Ptype', 'Type', 'Ctype', 'Mode', 'Start', 'End', 'Strand']
214214
merged = all_data[0]
215215
for data in all_data[1:]:
216216
merged = merged.merge(data, on=metadata_cols, how='outer')
217217

218218
# Fill NA values with 0 for metrics
219219
merged = merged.fillna(0)
220220

221-
# Sort by SeqID, then ParentIDs, then AggregationMode
222-
merged = merged.sort_values(['SeqID', 'ParentIDs', 'AggregationMode'])
221+
# Sort by SeqID, then ParentIDs, then Mode
222+
merged = merged.sort_values(['SeqID', 'ParentIDs', 'Mode'])
223223

224-
metadata_cols = ['SeqID', 'ParentIDs', 'AggregateID', 'ParentType', 'AggregateType', 'AggregationMode']
224+
metadata_cols = ['SeqID', 'ParentIDs', 'ID', 'Mtype', 'Ptype', 'Type', 'Ctype', 'Mode', 'Start', 'End', 'Strand']
225225

226226
# Create one file per base pair combination
227227
output_files = []

0 commit comments

Comments
 (0)