Feature request - add a statistical cutoff for cluster merging?

Hey folks, big fans of starcode. I really like this tool (well designed, Levenshtein distance, code is readable), and I'd like to use it for all my seq clustering needs. I hope y'all are continuing active development towards Starcode2 ... ?

I was reading the Bartender clustering paper ( doi.org/10.1093/bioinformatics/btx655 ), and their author-run benchmarks show their Bartender tool working better for low-abundance clustering. One main difference between the tools that might explain this appears to be the use of a "Z-statistic" for cluster-merging. I believe other tools use similar statistical cutoffs (DADA2?), and it's my impression that starcode just uses a elegantly simple ratio ( https://github.com/gui11aume/starcode/blob/c30de3d163d0d823a494961f33ce1c42a5b3ca1a/src/starcode.c#L847 ). I need my clustering to be able to handle low counts clustering confidently, so I had a few questions:

- Do y'all think this difference in the low-count (<50) performance could be explained using a metric that scales for counting errors?

- Does such a "Z-statistic" make sense when using Levenshtein distances? For a single error-mode (Hamming, as Bartender uses), you can easily calculate the mutational distance. For multiple error-modes of different frequencies (base-change, indel), you may need to (1) fit each parameter on fixed sequence from elsewhere in the reads and (2) enumerate all possible combinations of errors that can lead to a certain distance. Or, weigh the graph distances by likelihood of error. Does this make sense to try? What about a half-way approximation?

- Are you interested in implementing an option to use such a "Z-statistic"? I was thinking of just hard-coding some tests on the above referenced line, does that sound like a reasonable way to test this idea out?

Please let me know if you folks have any ideas about this. And again, thanks for your work on this public resource.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request - add a statistical cutoff for cluster merging? #38

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature request - add a statistical cutoff for cluster merging? #38

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions