Skip to content

Commit a131585

Browse files
authored
Merge pull request #54 from Juke34/pluviomem
Finish pipe
2 parents 15c88f2 + 7263e8a commit a131585

38 files changed

Lines changed: 5462 additions & 875 deletions

README.md

Lines changed: 30 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -221,13 +221,14 @@ The two output formats are tables of comma-separated values with a header.
221221
| Start | Positive integer | Starting position of the feature (inclusive) |
222222
| End | Positive integer | Ending position of the feature (inclusive) |
223223
| Strand | `1` or `-1` | Whether the features is located on the positive (5'->3') or negative (3'->5') strand |
224-
| CoveredSites | Positive integer | Number of sites in the feature that satisfy the minimum level of coverage |
225-
| GenomeBases | Comma-separated positive integers | Frequencies of the bases in the feature in the reference genome (order: A, C, G, T) |
226-
| SiteBasePairings | Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found in the feature (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) |
227-
| ReadBasePairings | Comma-separated positive integers | Frequencies of genome-variant base pairings in the feature (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) |
224+
| TotalSites | Positive integer | Number of sites in the feature |
225+
| ObservedBases | Comma-separated positive integers | Number and type of the bases in the feature in the reference genome (order: A, C, G, T) observed. The total of the 4 values corresponds to the total observed sites (reported by the editing tools e.g. Reditools3) |
226+
| QualifiedBases | Comma-separated positive integers | Number and type of of the bases in the feature in the reference genome (order: A, C, G, T) that satisfy the minimum level of coverage and editing. The total of the 4 values corresponds to the total qualified sites (> cov) |
227+
| SiteBasePairingsQualified| Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found at reference level in the feature (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) that satisfy the minimum level of coverage and editing |
228+
| ReadBasePairingsQualified | Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found at reads level in the feature (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) that satisfy the minimum level of coverage and editing |
228229

229230
> [!note]
230-
> The number of **CoveredSites** can be higher than the sum of **SiteBasePairings** because of the presence of ambiguous bases (e.g. N)
231+
> The number of **QualifiedBases** can differ from sum of AA,CC,GG,TT from **SiteBasePairingsQualified** because we can have site 100% edited that will not fall into one of these categories.
231232

232233
An example of the feature output format is shown below, with some alterations to make the text line up in columns.
233234

@@ -275,10 +276,11 @@ This hierarchical information is provided in the same manner in the aggregate fi
275276
| ParentType | String | Type of the parent of the feature under which the aggregation was done |
276277
| AggregateType | String | Type of the features that are aggregated |
277278
| AggregationMode | `all_isoforms`, `longest_isoform`, `chimaera`, `feature` or `all-sites` | Way in which the aggregation was performed |
278-
| CoveredSites | Positive integer | Number of sites in the aggregated features that satisfy the minimum level of coverage |
279-
| GenomeBases | Comma-separated positive integers | Frequencies of the bases in the aggregated features in the reference genome (order: A, C, G, T) |
280-
| SiteBasePairings | Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found in the aggregated features (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) |
281-
| ReadBasePairings | Comma-separated positive integers | Frequencies of genome-variant base pairings in the aggregated features (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) |
279+
| TotalSites | Positive integer | Number of sites in the aggregated features |
280+
| ObservedBases | Comma-separated positive integers | Number and type of the bases in the aggregated features in the reference genome (order: A, C, G, T) observed. The total of the 4 values corresponds to the total observed sites (reported by the editing tools e.g. Reditools3) | |
281+
| QualifiedBases | Comma-separated positive integers | Number and type of of the bases in the aggregated features in the reference genome (order: A, C, G, T) that satisfy the minimum level of coverage and editing. The total of the 4 values corresponds to the total qualified sites (> cov) | |
282+
| SiteBasePairingsQualifed | Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found at reference level in the aggregated features (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) observed |
283+
| ReadBasePairingsQualifed | Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found at reads level in the aggregated features (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) that satisfy the minimum level of coverage and editing|
282284
283285
In the output of Pluviometer, **aggregation** is the sum of counts from several features of the same type at some feature level. For instance, exons can be aggregated at transcript level, gene level, chromosome level, and genome level.
284286
@@ -344,6 +346,21 @@ $$
344346
AG\ editing\ level = \sum_{i=0}^{n} \dfrac{AG_i}{AA_i + AC_i + AG_i + AT_i}
345347
$$
346348
349+
350+
## Drip
351+
352+
### espf (edited sites proportion in feature):
353+
354+
denom_espf = df[f'{genome_base}_count'] # X_QualifiedBases (e.g. C_count)
355+
df[espf_col] = df[f'{bp}_sites'] / denom_espf # XY_SiteBasePairingsQualified / X_QualifiedBases
356+
357+
### espr (edited sites proportion in reads):
358+
359+
df[total_reads_col] = XA_reads + XC_reads + XG_reads + XT_reads # all reads at X positions
360+
df[espr_col] = df[f'{bp}_reads'] / df[total_reads_col] # XY_reads / sum(X*_reads)
361+
362+
Drip retains a line only if at least one metric value is neither NA nor zero (i.e., at least one edit has been detected somewhere). Lines containing only NA values, only 0.0 values, or a mix of both are removed by default.
363+
347364
</details>
348365
349366
@@ -355,3 +372,7 @@ Jacques Dainat (@Juke34)
355372
## Contributing
356373
357374
Contributions from the community are welcome ! See the [Contributing guidelines](https://github.com/Juke34/rain/blob/main/CONTRIBUTING.md)
375+
376+
## TODO
377+
378+
update pluviometer to set NA for start end and strand instead of . to be able to use column as int64 in drip and barometer e.g. dtype={"SeqID": str, "Start": "Int64", "End": "Int64", "Strand": str}

bin/README

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -20,20 +20,12 @@ python -m pluviometer --sites SITES --gff GFF [OPTIONS]
2020
python pluviometer_wrapper.py --sites SITES --gff GFF [OPTIONS]
2121
```
2222

23-
### drip_features.py
23+
### drip.py
2424
Post-processing tool for pluviometer feature output. Analyzes RNA editing from feature TSV files, calculating editing metrics (espf and espr) for all 16 genome-variant base pair combinations across multiple samples. Combines data into unified matrix format.
2525

2626
**Usage:**
2727
```bash
28-
./drip_features.py OUTPUT_PREFIX FILE1:SAMPLE1 FILE2:SAMPLE2 [...]
29-
```
30-
31-
### drip_aggregates.py
32-
Post-processing tool for pluviometer aggregate output. Similar to drip_features.py but operates on aggregate-level data, calculating editing metrics for aggregated genomic regions across samples.
33-
34-
**Usage:**
35-
```bash
36-
./drip_aggregates.py OUTPUT_PREFIX FILE1:SAMPLE1 FILE2:SAMPLE2 [...]
28+
./drip.py OUTPUT_PREFIX FILE1:GROUP1:SAMPLE1:REPLICATE1 FILE2:GROUP1:SAMPLE2:REPLICATE1 [...]
3729
```
3830

3931
### restore_sequences.py

0 commit comments

Comments
 (0)