Skip to content

Feature request - add a statistical cutoff for cluster merging? #38

@darachm

Description

@darachm

Hey folks, big fans of starcode. I really like this tool (well designed, Levenshtein distance, code is readable), and I'd like to use it for all my seq clustering needs. I hope y'all are continuing active development towards Starcode2 ... ?

I was reading the Bartender clustering paper ( doi.org/10.1093/bioinformatics/btx655 ), and their author-run benchmarks show their Bartender tool working better for low-abundance clustering. One main difference between the tools that might explain this appears to be the use of a "Z-statistic" for cluster-merging. I believe other tools use similar statistical cutoffs (DADA2?), and it's my impression that starcode just uses a elegantly simple ratio (

if (maxcount < CLUSTER_RATIO * mincount) continue;
). I need my clustering to be able to handle low counts clustering confidently, so I had a few questions:

  • Do y'all think this difference in the low-count (<50) performance could be explained using a metric that scales for counting errors?

  • Does such a "Z-statistic" make sense when using Levenshtein distances? For a single error-mode (Hamming, as Bartender uses), you can easily calculate the mutational distance. For multiple error-modes of different frequencies (base-change, indel), you may need to (1) fit each parameter on fixed sequence from elsewhere in the reads and (2) enumerate all possible combinations of errors that can lead to a certain distance. Or, weigh the graph distances by likelihood of error. Does this make sense to try? What about a half-way approximation?

  • Are you interested in implementing an option to use such a "Z-statistic"? I was thinking of just hard-coding some tests on the above referenced line, does that sound like a reasonable way to test this idea out?

Please let me know if you folks have any ideas about this. And again, thanks for your work on this public resource.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions