AssemblMore

A comprehensive genome assembly refinement pipeline that automatically merges and refines contigs from next-generation assemblers (Flye, Canu, etc.) using reference-guided approaches and ultra-long reads.

Overview

AssemblMore enhances genome assemblies through a multi-step pipeline that:

Maps assembly contigs to reference genomes for optimal placement
Uses spanning ultra-long reads to merge and extend contigs
Fills gaps with high-quality sequence data
Provides comprehensive quality assessment and comparative analysis
Future iterations will attempt to 'un-collapse' complex genomic regions such as rRNA arrays, telomeres, etc.

Features

🧬 Assembly Improvement

Reference-guided contig placement and orientation
Ultra-long read spanning for contig merging
Intelligent gap filling with quality filtering
Telomere extension and complex region handling

📊 Quality Assessment

Comprehensive assembly statistics (N50, auN, contig count)
Visual comparative analysis with NX plots
Coverage analysis and quality metrics
Before/after improvement comparisons

🔧 Robust Pipeline

Automated dependency checking and installation
Flexible parameter configuration
Comprehensive error handling and logging
Support for various input formats (FASTA/FASTQ)

📈 Advanced Analytics

Issues metric calculation with cluster identification
Length distribution analysis
Coverage depth assessment
R-based statistical visualization

Quick Start

Installation

Clone the repository:

git clone https://github.com/TejusK123/assemblmore.git
cd assemblmore

Install dependencies:

chmod +x install_dependencies.sh
./install_dependencies.sh

Install paftools.js (Required):

# Clone minimap2 repository to get paftools.js
cd - 
git clone https://github.com/lh3/minimap2.git
# Add paftools.js to your PATH
export PATH="$(pwd)/minimap2/misc:$PATH"
# Make permanent by adding to ~/.bashrc or ~/.zshrc
echo 'export PATH="'$(pwd)'/minimap2/misc:$PATH"' >> ~/.bashrc  # or ~/.zshrc

Add to PATH (optional):

# Add to your ~/.bashrc or ~/.zshrc
export PATH="$HOME/path/to/assemblmore/assemblmore/src:$PATH"

Basic Usage

# Run the pipeline
assemblmore reference.fasta assembly.fasta reads.fastq

New to AssemblMore? Try our complete example first:

cd assemblmore/examples && ./run_example.sh

Example Dataset

We provide a complete example dataset to test AssemblMore and demonstrate its capabilities:

C. briggsae AF16 Example

# Navigate to examples directory
cd assemblmore/examples

# Install gdown for dataset download
pip install gdown

# Run the complete example
chmod +x run_example.sh
./run_example.sh

Dataset details:

Organism: Caenorhabditis briggsae AF16 strain
Sequencing: MinION r10.4 Nanopore (Rog Lab)
Reference: C. briggsae QX1410 (PRJNA784955, WBPS19)
Assembly: Flye-generated initial assembly

The example demonstrates typical improvements:

Improved N50 and contiguity
Corrected contig orientation
Gap filling with spanning reads
Comprehensive quality assessment

See assemblmore/examples/README.md for detailed documentation, troubleshooting, and adding custom examples.

Installation

Prerequisites

The pipeline requires the following tools:

minimap2 - For sequence alignment
samtools - For SAM/BAM file manipulation
paftools.js - For PAF format conversion
seqkit - For FASTA/FASTQ manipulation
python (3.6+) - For analysis scripts
R (optional) - For enhanced statistics and visualization

Python Dependencies

pip install numpy pandas biopython networkx more_itertools click

R Dependencies (Optional)

install.packages(c("ggplot2", "dplyr", "readr", "scales", "dbscan"))

Automated Installation

Use the provided script to install all dependencies:

cd assemblmore
./install_dependencies.sh

Important: paftools.js Setup

paftools.js is a critical dependency that is not included in binary distributions of minimap2 from package managers (conda, brew, apt, etc.). You must install it separately:

# Clone the minimap2 repository to get paftools.js
git clone https://github.com/lh3/minimap2.git

# Add the misc directory to your PATH
export PATH="$(pwd)/minimap2/misc:$PATH"

# Make the PATH change permanent (choose your shell)
echo 'export PATH="'$(pwd)'/minimap2/misc:$PATH"' >> ~/.bashrc    # For bash
echo 'export PATH="'$(pwd)'/minimap2/misc:$PATH"' >> ~/.zshrc     # For zsh

Why this is necessary:

Package managers (conda, brew, apt) only install the minimap2 binary
paftools.js and other utility scripts are not included in these distributions
The pipeline requires paftools.js for PAF format conversion
Manual installation from source includes these scripts in the misc/ directory

Verification:

# Restart your terminal or source your shell config
source ~/.bashrc  # or source ~/.zshrc

# Verify paftools.js is accessible
which paftools.js
paftools.js --help

Alternative locations (if you already have minimap2 source):

# If you previously compiled minimap2 from source
export PATH="/path/to/your/minimap2/misc:$PATH"

Usage

Command Line Interface

# If running from the src directory:
./assemblmore_pipeline.sh <reference_genome.fasta> <assembly.fasta> <reads.fastq> [OPTIONS]

# If you added src to PATH (recommended):
assemblmore <reference_genome.fasta> <assembly.fasta> <reads.fastq> [OPTIONS]

Required Arguments

reference_genome.fasta - Reference genome in FASTA format
assembly.fasta - Initial assembly to improve
reads.fastq - Raw sequencing reads (FASTQ/FASTA)

Optional Parameters

Parameter	Default	Description
`--expected_telomere_length`	8000	Expected telomere length
`--phred_threshold`	20	Quality threshold for reads
`--length_threshold`	0	Minimum contig length
`--output_dir`	assemblmore_output	Output directory
`--skip_bam`	false	Skip BAM generation (PAF only)

Usage Examples

Basic assembly improvement:

# From src directory:
./assemblmore_pipeline.sh reference.fasta assembly.fasta reads.fastq

# If added to PATH:
assemblmore reference.fasta assembly.fasta reads.fastq

With custom parameters:

# From src directory:
./assemblmore_pipeline.sh reference.fasta assembly.fasta reads.fastq \
    --expected_telomere_length 10000 \
    --phred_threshold 25 \
    --output_dir my_results

# If added to PATH:
assemblmore reference.fasta assembly.fasta reads.fastq \
    --expected_telomere_length 10000 \
    --phred_threshold 25 \
    --output_dir my_results

Fast processing (skip BAM generation):

# From src directory:
./assemblmore_pipeline.sh reference.fasta assembly.fasta reads.fastq --skip_bam

# If added to PATH:
assemblmore reference.fasta assembly.fasta reads.fastq --skip_bam

Arguments in any order:

# From src directory:
./assemblmore_pipeline.sh --output_dir results reference.fasta --phred_threshold 30 assembly.fasta reads.fastq

# If added to PATH:
assemblmore --output_dir results reference.fasta --phred_threshold 30 assembly.fasta reads.fastq

Pipeline Steps

1. Reference Mapping

Maps assembly contigs to reference genome using minimap2
Determines optimal contig placement and orientation
Filters contigs based on quality thresholds

2. Contig Placement

Analyzes mapping results to determine contig order
Handles complex regions and overlapping mappings
Creates filtered contig placement file

3. Assembly Refinement

Generates improved assembly with corrected contig order
Applies reverse complementation where needed
Creates intermediate refined assembly

4. Gap Filling

Maps reads to refined assembly
Identifies and fills gaps with high-quality sequences
Extends contigs using spanning reads

5. Quality Assessment

Generates comprehensive assembly statistics
Creates comparative analysis (before/after)
Produces visualization plots and coverage analysis

Output Files

The pipeline generates numerous output files in the specified output directory:

Primary Outputs

assemblmore_final_assembly.fasta - Final improved assembly
final_assembly_coverage.txt - Coverage analysis
assembly_comparison.csv - Comparative statistics

Intermediate Files

*_mapped_contigs.paf - Mapping alignments
filtered_*_contigs.tsv - Filtered contig placements
ordered_and_oriented_*.fasta - Intermediate assemblies
*.sam/*.bam - Alignment files (if not skipped)

Analysis and Visualization

*_nx_plot.png - NX curve comparisons
*_length_distribution.png - Contig length distributions
*_assembly_stats.csv - Detailed statistics
Various R-generated plots and analysis files

Advanced Features

Working with Examples

Explore AssemblMore capabilities using our provided datasets:

cd assemblmore/examples
./run_example.sh  # C. briggsae AF16 example

The examples demonstrate advanced features like:

Reference-guided assembly improvement
Telomere extension with long reads
Comprehensive quality assessment
Issues metric calculation with coverage analysis

Comparative Analysis

The pipeline automatically compares original and improved assemblies:

# Results include side-by-side statistics
./assemblmore_pipeline.sh reference.fasta assembly.fasta reads.fastq

Manual Comparative Analysis

# Compare multiple assemblies
Rscript assembly_stats.R \
    original.fasta:Original \
    improved.fasta:Improved \
    polished.fasta:Polished \
    output_directory

Custom Assembly Statistics

# Generate detailed statistics for any assembly
Rscript assembly_stats.R assembly.fasta:MyAssembly output_dir

Performance and Optimization

Memory Usage

Large assemblies may require substantial memory
Consider using --skip_bam for faster processing
Monitor system resources during execution

Processing Time

Runtime scales with assembly size and read count
Typical runtime: 30 minutes to several hours
Use multiple cores where possible (minimap2 threading)

Quality Considerations

Best results with high-quality reference genomes
Optimal for assemblies with N50 ≥ 45kb
Long-read data (ONT/PacBio) recommended

Troubleshooting

Common Issues

Missing dependencies:
```
./install_dependencies.sh
```
Permission errors:
```
chmod +x *.sh
```

Path issues:

# Use absolute paths for inputs
./assemblmore_pipeline.sh /full/path/to/reference.fasta ...

Memory errors:
- Use smaller datasets for testing
- Enable --skip_bam to reduce memory usage
- Monitor system resources

Debugging

Enable verbose mode for detailed logging:

./assemblmore_pipeline.sh -v reference.fasta assembly.fasta reads.fastq

Keep intermediate files for debugging:

./assemblmore_pipeline.sh -k reference.fasta assembly.fasta reads.fastq

NOTE

All testing was done with relatively good datasets (N50 >= 45kb) on hermaphroditic or inbred worms.

When adding /src/ to path, make sure to NOT use ~/ shell expansion. Use the $HOME value instead.

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/improvement)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/improvement)
Create a Pull Request

License

This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3 - see the LICENSE file for details.

Citation

If you use AssemblMore in your research, please cite:

AssemblMore: A comprehensive genome assembly improvement pipeline
GitHub: https://github.com/TejusK123/assemblmore

Support

Issues: Report bugs and request features on GitHub Issues
Documentation: Check the src/README.md for detailed technical documentation
Examples: See assemblmore/examples/README.md for complete example datasets and tutorials
Quick Start: Run assemblmore/examples/run_example.sh for a hands-on demonstration

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
assemblmore		assemblmore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install_dependencies.sh		install_dependencies.sh

License

TejusK123/assemblmore

Folders and files

Latest commit

History

Repository files navigation

AssemblMore

Overview

Features

🧬 Assembly Improvement

📊 Quality Assessment

🔧 Robust Pipeline

📈 Advanced Analytics

Quick Start

Installation

Basic Usage

Example Dataset

C. briggsae AF16 Example

Installation

Prerequisites

Python Dependencies

R Dependencies (Optional)

Automated Installation

Important: paftools.js Setup

Usage

Command Line Interface

Required Arguments

Optional Parameters

Usage Examples

Basic assembly improvement:

With custom parameters:

Fast processing (skip BAM generation):

Arguments in any order:

Pipeline Steps

1. Reference Mapping

2. Contig Placement

3. Assembly Refinement

4. Gap Filling

5. Quality Assessment

Output Files

Primary Outputs

Intermediate Files

Analysis and Visualization

Advanced Features

Working with Examples

Comparative Analysis

Manual Comparative Analysis

Custom Assembly Statistics

Performance and Optimization

Memory Usage

Processing Time

Quality Considerations

Troubleshooting

Common Issues

Debugging

NOTE

Contributing

License

Citation

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages