A comprehensive genome assembly refinement pipeline that automatically merges and refines contigs from next-generation assemblers (Flye, Canu, etc.) using reference-guided approaches and ultra-long reads.
AssemblMore enhances genome assemblies through a multi-step pipeline that:
- Maps assembly contigs to reference genomes for optimal placement
- Uses spanning ultra-long reads to merge and extend contigs
- Fills gaps with high-quality sequence data
- Provides comprehensive quality assessment and comparative analysis
- Future iterations will attempt to 'un-collapse' complex genomic regions such as rRNA arrays, telomeres, etc.
- Reference-guided contig placement and orientation
- Ultra-long read spanning for contig merging
- Intelligent gap filling with quality filtering
- Telomere extension and complex region handling
- Comprehensive assembly statistics (N50, auN, contig count)
- Visual comparative analysis with NX plots
- Coverage analysis and quality metrics
- Before/after improvement comparisons
- Automated dependency checking and installation
- Flexible parameter configuration
- Comprehensive error handling and logging
- Support for various input formats (FASTA/FASTQ)
- Issues metric calculation with cluster identification
- Length distribution analysis
- Coverage depth assessment
- R-based statistical visualization
- Clone the repository:
git clone https://github.com/TejusK123/assemblmore.git
cd assemblmore- Install dependencies:
chmod +x install_dependencies.sh
./install_dependencies.sh- Install paftools.js (Required):
# Clone minimap2 repository to get paftools.js
cd -
git clone https://github.com/lh3/minimap2.git
# Add paftools.js to your PATH
export PATH="$(pwd)/minimap2/misc:$PATH"
# Make permanent by adding to ~/.bashrc or ~/.zshrc
echo 'export PATH="'$(pwd)'/minimap2/misc:$PATH"' >> ~/.bashrc # or ~/.zshrc- Add to PATH (optional):
# Add to your ~/.bashrc or ~/.zshrc
export PATH="$HOME/path/to/assemblmore/assemblmore/src:$PATH"# Run the pipeline
assemblmore reference.fasta assembly.fasta reads.fastqNew to AssemblMore? Try our complete example first:
cd assemblmore/examples && ./run_example.shWe provide a complete example dataset to test AssemblMore and demonstrate its capabilities:
# Navigate to examples directory
cd assemblmore/examples
# Install gdown for dataset download
pip install gdown
# Run the complete example
chmod +x run_example.sh
./run_example.shDataset details:
- Organism: Caenorhabditis briggsae AF16 strain
- Sequencing: MinION r10.4 Nanopore (Rog Lab)
- Reference: C. briggsae QX1410 (PRJNA784955, WBPS19)
- Assembly: Flye-generated initial assembly
The example demonstrates typical improvements:
- Improved N50 and contiguity
- Corrected contig orientation
- Gap filling with spanning reads
- Comprehensive quality assessment
See assemblmore/examples/README.md for detailed documentation, troubleshooting, and adding custom examples.
The pipeline requires the following tools:
- minimap2 - For sequence alignment
- samtools - For SAM/BAM file manipulation
- paftools.js - For PAF format conversion
- seqkit - For FASTA/FASTQ manipulation
- python (3.6+) - For analysis scripts
- R (optional) - For enhanced statistics and visualization
pip install numpy pandas biopython networkx more_itertools clickinstall.packages(c("ggplot2", "dplyr", "readr", "scales", "dbscan"))Use the provided script to install all dependencies:
cd assemblmore
./install_dependencies.shpaftools.js is a critical dependency that is not included in binary distributions of minimap2 from package managers (conda, brew, apt, etc.). You must install it separately:
# Clone the minimap2 repository to get paftools.js
git clone https://github.com/lh3/minimap2.git
# Add the misc directory to your PATH
export PATH="$(pwd)/minimap2/misc:$PATH"
# Make the PATH change permanent (choose your shell)
echo 'export PATH="'$(pwd)'/minimap2/misc:$PATH"' >> ~/.bashrc # For bash
echo 'export PATH="'$(pwd)'/minimap2/misc:$PATH"' >> ~/.zshrc # For zshWhy this is necessary:
- Package managers (conda, brew, apt) only install the minimap2 binary
- paftools.js and other utility scripts are not included in these distributions
- The pipeline requires paftools.js for PAF format conversion
- Manual installation from source includes these scripts in the
misc/directory
Verification:
# Restart your terminal or source your shell config
source ~/.bashrc # or source ~/.zshrc
# Verify paftools.js is accessible
which paftools.js
paftools.js --helpAlternative locations (if you already have minimap2 source):
# If you previously compiled minimap2 from source
export PATH="/path/to/your/minimap2/misc:$PATH"# If running from the src directory:
./assemblmore_pipeline.sh <reference_genome.fasta> <assembly.fasta> <reads.fastq> [OPTIONS]
# If you added src to PATH (recommended):
assemblmore <reference_genome.fasta> <assembly.fasta> <reads.fastq> [OPTIONS]reference_genome.fasta- Reference genome in FASTA formatassembly.fasta- Initial assembly to improvereads.fastq- Raw sequencing reads (FASTQ/FASTA)
| Parameter | Default | Description |
|---|---|---|
--expected_telomere_length |
8000 | Expected telomere length |
--phred_threshold |
20 | Quality threshold for reads |
--length_threshold |
0 | Minimum contig length |
--output_dir |
assemblmore_output | Output directory |
--skip_bam |
false | Skip BAM generation (PAF only) |
# From src directory:
./assemblmore_pipeline.sh reference.fasta assembly.fasta reads.fastq
# If added to PATH:
assemblmore reference.fasta assembly.fasta reads.fastq# From src directory:
./assemblmore_pipeline.sh reference.fasta assembly.fasta reads.fastq \
--expected_telomere_length 10000 \
--phred_threshold 25 \
--output_dir my_results
# If added to PATH:
assemblmore reference.fasta assembly.fasta reads.fastq \
--expected_telomere_length 10000 \
--phred_threshold 25 \
--output_dir my_results# From src directory:
./assemblmore_pipeline.sh reference.fasta assembly.fasta reads.fastq --skip_bam
# If added to PATH:
assemblmore reference.fasta assembly.fasta reads.fastq --skip_bam# From src directory:
./assemblmore_pipeline.sh --output_dir results reference.fasta --phred_threshold 30 assembly.fasta reads.fastq
# If added to PATH:
assemblmore --output_dir results reference.fasta --phred_threshold 30 assembly.fasta reads.fastq- Maps assembly contigs to reference genome using minimap2
- Determines optimal contig placement and orientation
- Filters contigs based on quality thresholds
- Analyzes mapping results to determine contig order
- Handles complex regions and overlapping mappings
- Creates filtered contig placement file
- Generates improved assembly with corrected contig order
- Applies reverse complementation where needed
- Creates intermediate refined assembly
- Maps reads to refined assembly
- Identifies and fills gaps with high-quality sequences
- Extends contigs using spanning reads
- Generates comprehensive assembly statistics
- Creates comparative analysis (before/after)
- Produces visualization plots and coverage analysis
The pipeline generates numerous output files in the specified output directory:
assemblmore_final_assembly.fasta- Final improved assemblyfinal_assembly_coverage.txt- Coverage analysisassembly_comparison.csv- Comparative statistics
*_mapped_contigs.paf- Mapping alignmentsfiltered_*_contigs.tsv- Filtered contig placementsordered_and_oriented_*.fasta- Intermediate assemblies*.sam/*.bam- Alignment files (if not skipped)
*_nx_plot.png- NX curve comparisons*_length_distribution.png- Contig length distributions*_assembly_stats.csv- Detailed statistics- Various R-generated plots and analysis files
Explore AssemblMore capabilities using our provided datasets:
cd assemblmore/examples
./run_example.sh # C. briggsae AF16 exampleThe examples demonstrate advanced features like:
- Reference-guided assembly improvement
- Telomere extension with long reads
- Comprehensive quality assessment
- Issues metric calculation with coverage analysis
The pipeline automatically compares original and improved assemblies:
# Results include side-by-side statistics
./assemblmore_pipeline.sh reference.fasta assembly.fasta reads.fastq# Compare multiple assemblies
Rscript assembly_stats.R \
original.fasta:Original \
improved.fasta:Improved \
polished.fasta:Polished \
output_directory# Generate detailed statistics for any assembly
Rscript assembly_stats.R assembly.fasta:MyAssembly output_dir- Large assemblies may require substantial memory
- Consider using
--skip_bamfor faster processing - Monitor system resources during execution
- Runtime scales with assembly size and read count
- Typical runtime: 30 minutes to several hours
- Use multiple cores where possible (minimap2 threading)
- Best results with high-quality reference genomes
- Optimal for assemblies with N50 ≥ 45kb
- Long-read data (ONT/PacBio) recommended
-
Missing dependencies:
./install_dependencies.sh
-
Permission errors:
chmod +x *.sh -
Path issues:
# Use absolute paths for inputs ./assemblmore_pipeline.sh /full/path/to/reference.fasta ... -
Memory errors:
- Use smaller datasets for testing
- Enable
--skip_bamto reduce memory usage - Monitor system resources
Enable verbose mode for detailed logging:
./assemblmore_pipeline.sh -v reference.fasta assembly.fasta reads.fastqKeep intermediate files for debugging:
./assemblmore_pipeline.sh -k reference.fasta assembly.fasta reads.fastqAll testing was done with relatively good datasets (N50 >= 45kb) on hermaphroditic or inbred worms.
When adding /src/ to path, make sure to NOT use ~/ shell expansion. Use the $HOME value instead.
- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/improvement) - Create a Pull Request
This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3 - see the LICENSE file for details.
If you use AssemblMore in your research, please cite:
AssemblMore: A comprehensive genome assembly improvement pipeline
GitHub: https://github.com/TejusK123/assemblmore
- Issues: Report bugs and request features on GitHub Issues
- Documentation: Check the
src/README.mdfor detailed technical documentation - Examples: See
assemblmore/examples/README.mdfor complete example datasets and tutorials - Quick Start: Run
assemblmore/examples/run_example.shfor a hands-on demonstration