-
Notifications
You must be signed in to change notification settings - Fork 115
Description
Dear PICRUSt2 team,
I am currently examining the 16S rRNA reference FASTA files used in PICRUSt2 and noticed that some sequence records are labeled with identifiers such as “cluster_XXX”, while others are not.
Could you clarify what these “cluster_” identifiers represent in the context of the reference database? Specifically:
- Are these derived from OTU clustering (e.g., CD-HIT/VSEARCH)?
- Do they correspond to representative sequences from dereplicated groups?
- Or do they reflect some internal processing step specific to PICRUSt2?
In addition, I am attempting to construct a custom PICRUSt2 reference database by combining:
- Older IMG-derived genome sequences (legacy version), and
- The mouse gut iMGMC PICRUSt2-compatible reference dataset
My goal is to integrate both into a unified database for prediction.
Could you advise on the recommended workflow for this? In particular:
-
How should sequence IDs be standardized (e.g., handling cluster vs non-cluster labels)?
-
Are there specific requirements for:
-
16S sequence FASTA formatting
-
Phylogenetic placement (EPA-NG / SEPP steps)
-
Hidden-state prediction inputs (trait tables, marker gene copy numbers)
-
Is there an existing pipeline or script to rebuild a PICRUSt2-compatible database from combined sources?
Finally, could you advise on the best point of contact for more detailed guidance on custom database construction? Should this be directed via GitHub issues, or is there a more appropriate channel?
Thank you for your time and for maintaining PICRUSt2.
Best regards,
Abel Tan