Subject: Clarification on “cluster_” identifiers and guidance on building a custom PICRUSt2 reference database



Dear PICRUSt2 team,

I am currently examining the 16S rRNA reference FASTA files used in PICRUSt2 and noticed that some sequence records are labeled with identifiers such as “cluster_XXX”, while others are not.

Could you clarify what these “cluster_” identifiers represent in the context of the reference database? Specifically:

- Are these derived from OTU clustering (e.g., CD-HIT/VSEARCH)?
- Do they correspond to representative sequences from dereplicated groups?
-  Or do they reflect some internal processing step specific to PICRUSt2?


In addition, I am attempting to construct a custom PICRUSt2 reference database by combining:

1. Older IMG-derived genome sequences (legacy version), and
2. The mouse gut iMGMC PICRUSt2-compatible reference dataset

My goal is to integrate both into a unified database for prediction.

Could you advise on the recommended workflow for this? In particular:

-  How should sequence IDs be standardized (e.g., handling cluster vs non-cluster labels)?
-  Are there specific requirements for:

-    16S sequence FASTA formatting
-    Phylogenetic placement (EPA-NG / SEPP steps)
-    Hidden-state prediction inputs (trait tables, marker gene copy numbers)
-  Is there an existing pipeline or script to rebuild a PICRUSt2-compatible database from combined sources?

Finally, could you advise on the best point of contact for more detailed guidance on custom database construction? Should this be directed via GitHub issues, or is there a more appropriate channel?

Thank you for your time and for maintaining PICRUSt2.

Best regards,
Abel Tan


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subject: Clarification on “cluster_” identifiers and guidance on building a custom PICRUSt2 reference database #412

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Subject: Clarification on “cluster_” identifiers and guidance on building a custom PICRUSt2 reference database #412

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions