A flexible pipeline for clustering sequences based on phylogenetic distances, with optional taxonomy-based consistency checks.
PDMClust uses phylogenetic distances to cluster sequences at different thresholds. It can also incorporate taxonomy information to ensure that clusters are consistent at a specified taxonomic level (e.g., genus, family).
graph TD
A[Input Files] --> B[PhyloDM]
A --> |Optional| C[Taxonomy]
B --> D[Distance Matrix]
D --> E[MCL Clustering]
C --> F[Taxonomy-Aware Clustering]
E --> F
F --> G[Final Clusters]
G --> H[Statistics & Visualization]
- π³ Clustering based on phylogenetic distances
- ποΈ Multiple clustering thresholds (0.99, 0.95, 0.9, 0.8, 0.7 by default)
- π¬ Taxonomy-based consistency checks
- π Automatic detection of input files
- π Comprehensive output with statistics
- π Python 3.10+
- π¦ Required Python packages:
- πΌ pandas
- π’ numpy
- π² dendropy
- 𧬠phylodm
- π mcl (Markov Clustering algorithm)
Pixi is a package manager that makes it easy to set up isolated environments with all the required dependencies.
# Install pixi if you don't have it
curl -fsSL https://pixi.sh/install.sh | bash
# Clone the repository
git clone https://github.com/NeLLi-team/pdmclust.git
cd pdmclust
# Install dependencies
pixi installApptainer (formerly Singularity) is ideal for HPC environments where Docker might not be available.
# Use the provided build script (handles both Apptainer and Singularity)
./build_container.sh
# Run the Apptainer container with absolute paths inside the container
# but relative paths outside for platform independence
apptainer run \
--bind $(pwd)/test:/input \
--bind $(pwd):/output \
phylodm-clustering.sif \
--input_dir /input \
--output_dir /output/test_results
# You can also use relative paths in your command, which will be converted
# to absolute paths inside the container:
apptainer run \
--bind ./test:/input \
--bind .:/output \
phylodm-clustering.sif \
--input_dir /input \
--output_dir /output/test_resultsFor older Singularity installations:
# Use the provided build script
./build_container.sh
# Run with Singularity
singularity run \
--bind $(pwd)/test:/input \
--bind $(pwd):/output \
phylodm-clustering.sif \
--input_dir /input \
--output_dir /output/test_resultsThe build script will attempt several build methods in sequence until one succeeds:
- Build with fakeroot
- Build with fakeroot and ignore-fakeroot-command
- Build without fakeroot
- Build using the sandbox method
This ensures the container can be built in a variety of environments with different security configurations.
The container uses absolute paths internally for reliable operation, but displays relative paths in the output for platform independence:
- Inside the container: All paths are converted to absolute paths to ensure reliable operation
- In output files: Paths are displayed as relative paths (just filenames) for platform independence
- Bind mounts: Use
--bindto map your local directories to paths inside the container
This approach ensures that the pipeline works reliably inside the container while maintaining platform independence for the output files.
python phylodm_clustering.py --input_dir testThis will:
- Look for tree, alignment, and taxonomy files in the input directory
- Perform clustering at thresholds 0.99, 0.95, 0.9, 0.8, 0.7
- Create a results directory with the clustering results
We've created several predefined tasks in the pixi.toml file for common use cases:
# Run with test data (no arguments needed)
pixi run run-test
# Run with taxonomy-based clustering on test data
pixi run run-test-with-taxonomy
# Run with custom cutoffs (0.95, 0.9, 0.85) on test data
pixi run run-test-custom-cutoffs-95-90-85
# Run with custom cutoffs (0.99, 0.95, 0.9) on test data
pixi run run-test-custom-cutoffs-99-95-90For more advanced usage, you can run the Python script directly with additional parameters:
python phylodm_clustering.py --input_dir /path/to/input --output_dir /path/to/output --tree /path/to/tree.nwk --seqfile /path/to/alignment.faa --taxonomy /path/to/taxonomy.tsv --tax_level 3 --cutoffs 0.99,0.95,0.9,0.85,0.8,0.75,0.7You can extract cluster representatives (the first sequence from each cluster) at a specific threshold:
# Extract representatives at threshold 0.7
python phylodm_clustering.py --input_dir test --extract 0.7
# Run clustering and extract representatives in one command
python phylodm_clustering.py --input_dir test --cutoffs 0.99,0.95,0.9,0.8,0.7 --extract 0.7
# Extract-only mode (only extracts representatives without redoing clustering if output exists)
python phylodm_clustering.py --input_dir test --extract 0.7 --extract_only
# Extract representatives at the highest threshold where all clusters are consistent at a specific taxonomy level
python phylodm_clustering.py --input_dir test --extract_tax_level 2 # Genus level
python phylodm_clustering.py --input_dir test --extract_tax_level 3 # Family level
python phylodm_clustering.py --input_dir test --extract_tax_level 4 # Order level
python phylodm_clustering.py --input_dir test --extract_tax_level 5 # Class level
python phylodm_clustering.py --input_dir test --extract_tax_level 6 # Phylum level
# You can also use string names for taxonomy levels
python phylodm_clustering.py --input_dir test --extract_tax_level genus
python phylodm_clustering.py --input_dir test --extract_tax_level family
python phylodm_clustering.py --input_dir test --extract_tax_level order
python phylodm_clustering.py --input_dir test --extract_tax_level class
python phylodm_clustering.py --input_dir test --extract_tax_level phylum
# You can specify a particular sequence file to use
python phylodm_clustering.py --input_dir test --seqfile test/alignment.faa --extract_tax_level family
# You can specify all input files explicitly
python phylodm_clustering.py --input_dir test --tree test/tree.nwk --seqfile test/alignment.faa --taxonomy test/taxonomy.tsv --extract_tax_level familyThe extract-only mode is useful when you've already run clustering and just want to extract representatives without redoing the clustering. If the output directory doesn't exist yet, it will run clustering only at the specified threshold.
The --extract_tax_level option automatically runs clustering at a range of thresholds (from 0.9 down to 0.01) and finds the highest threshold where all clusters are consistent at the specified taxonomy level. This is useful for finding the optimal threshold for a specific taxonomy level.
The output file will be named based on the input alignment file with the threshold appended (e.g., pimascovirales_faa--GVOG7-fasttree-perc3_pdm07.mafft_t). The threshold is formatted with a "pdm" prefix followed by the threshold value without the decimal point.
If you need to add more custom tasks, you can edit the pixi.toml file to add your own tasks:
# Example of adding a custom task for a specific directory
run-dir-your-directory = "python phylodm_clustering.py --input_dir your_directory"
# Example of adding a custom task with specific parameters
run-custom-task = "python phylodm_clustering.py --input_dir your_directory --cutoffs 0.95,0.9 --tax_level 3"
# Example of adding a custom task for extraction
run-extract-custom = "python phylodm_clustering.py --input_dir your_directory --extract 0.7"A phylogenetic tree in Newick format (*.tree, *.nwk, *.tre, *.treefile, *.contree).
An alignment file in FASTA format (*.aln, *.mafft, *.mafft_t, *.faa, *.fa, *.fasta).
A tab-separated file with two columns:
- Sequence ID
- Taxonomy string in the format "domain|phylum|class|order|family|genus|species"
Example:
seq1 Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae|Escherichia|Escherichia coli
seq2 Bacteria|Firmicutes|Bacilli|Bacillales|Bacillaceae|Bacillus|Bacillus subtilis
A tab-separated file with the first column containing sequence IDs and subsequent columns containing counts for each sample.
Example:
genome_id sample1 sample2 sample3
seq1 10 5 2
seq2 8 4 1
seq3 6 3 0
The pipeline creates a results directory with the following structure:
results_directory/
βββ README.md # Summary of the results
βββ combined_stats.tsv # Combined statistics for all thresholds
βββ clusters/ # Final clusters for each threshold
β βββ threshold_0.99.txt
β βββ threshold_0.95.txt
β βββ threshold_0.90.txt
β βββ threshold_0.80.txt
β βββ threshold_0.70.txt
βββ phylodm_out/ # Raw PhyloDM output files
βββ threshold_0.99_0.99.3c
βββ threshold_0.99_0.99.3c.mcl
βββ threshold_0.95_0.95.3c
βββ ...
When using taxonomy-based clustering, you can specify the taxonomic level at which to enforce consistency:
- π§« Level 1: Species
- π¦ Level 2: Genus (default)
- πͺ Level 3: Family
- ποΈ Level 4: Order
- π« Level 5: Class
- π Level 6: Phylum
- π Level 7: Domain
Here are the clustering statistics for the test dataset at different thresholds:
| Threshold | Number of Clusters | Number of Singletons | Average Cluster Size | Largest Cluster Size |
|---|---|---|---|---|
| 0.99 | 179 | 178 | 1.01 | 2 |
| 0.95 | 174 | 168 | 1.03 | 2 |
| 0.90 | 171 | 162 | 1.05 | 2 |
| 0.80 | 161 | 146 | 1.12 | 4 |
| 0.70 | 153 | 133 | 1.18 | 4 |
As expected, lowering the threshold results in fewer clusters and more sequences per cluster.
This project is licensed under the MIT License.
- NeLLi Team
- Frederik Schulz (fschulz@lbl.gov)
For questions or support, please open an issue on GitHub.