-
Notifications
You must be signed in to change notification settings - Fork 22
Feature request - add a statistical cutoff for cluster merging? #38
Description
Hey folks, big fans of starcode. I really like this tool (well designed, Levenshtein distance, code is readable), and I'd like to use it for all my seq clustering needs. I hope y'all are continuing active development towards Starcode2 ... ?
I was reading the Bartender clustering paper ( doi.org/10.1093/bioinformatics/btx655 ), and their author-run benchmarks show their Bartender tool working better for low-abundance clustering. One main difference between the tools that might explain this appears to be the use of a "Z-statistic" for cluster-merging. I believe other tools use similar statistical cutoffs (DADA2?), and it's my impression that starcode just uses a elegantly simple ratio (
Line 847 in c30de3d
| if (maxcount < CLUSTER_RATIO * mincount) continue; |
-
Do y'all think this difference in the low-count (<50) performance could be explained using a metric that scales for counting errors?
-
Does such a "Z-statistic" make sense when using Levenshtein distances? For a single error-mode (Hamming, as Bartender uses), you can easily calculate the mutational distance. For multiple error-modes of different frequencies (base-change, indel), you may need to (1) fit each parameter on fixed sequence from elsewhere in the reads and (2) enumerate all possible combinations of errors that can lead to a certain distance. Or, weigh the graph distances by likelihood of error. Does this make sense to try? What about a half-way approximation?
-
Are you interested in implementing an option to use such a "Z-statistic"? I was thinking of just hard-coding some tests on the above referenced line, does that sound like a reasonable way to test this idea out?
Please let me know if you folks have any ideas about this. And again, thanks for your work on this public resource.