Skip to content

Pipeline for long-read assembly, genome annotation (repeats, TEs, telomeres, rRNA/mtDNA), quality control, and targeted tgrBC locus analysis in Dictyostelium genomes.

License

Notifications You must be signed in to change notification settings

teralevin/dicty_genomes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contributors

Dictyostelium Genome Annotation, Assembly & tgr Locus Analysis

This repository contains a collection of scripts and workflows to:

  1. Annotate new Dictyostelium genome assemblies (tandem repeats, telomeres, transposable elements, rRNA, and mtDNA)
  2. Assemble and polish long‐read genome assemblies
  3. Quality-control variant calls and structural errors
  4. Extract and analyze the tgrBC self/non-self recognition locus, including its intersections with transposable elements and structural errors

Assembled genomes and annotations have been deposited in NCBI under:


Table of Contents


Installation

Requirements

To use this repository, ensure the following requirements are met:

  • Operating System: Linux
  • Python Version: Python 3.10+ (tested on 3.12)
  • Python Packages: pandas, numpy
  • External Tools: SAMtools, Bedtools, Minimap2, BWA, BLAST+, TRF, CRAQ, BCFtools

Steps

  1. Clone this repository:

    git clone https://github.com/MayarMAhmed/dicty_genome.git
    cd dicty_genome
  2. Install Python dependencies (recommended using conda):

    conda create -n dicty_env python=3.10 
    conda activate dicty_env
    conda install numpy pandas
  3. Install external tools via conda or your package manager.


Repository Structure

├── annotation
│   ├── 01_run_trf.sh                # Run Tandem Repeat Finder on genome FASTA
│   ├── 02_extract_telomere_motif.sh # Search for putative telomere motifs 
│   ├── 03_filter_telomeres.py       # Filter and process telomere motif hits
│   ├── 04_find_TE.sh                # Identify transposable elements (RepeatMasker/BLAST)
│   ├── 05_generate_TE_bedgraph.py   # Generate bedGraph of TE density
│   ├── 06_run_rrna_blast.sh         # Extract rRNA transcripts via BLAST
│   ├── 07_map_rrna_summary.py       # Summarize rRNA BLAST results
│   ├── 08_search_mtdna_dna.sh       # Identify mtDNA‐containing contigs with BLASTn
│   ├── 09_search_mtdna_prot.sh      # Identify mtDNA proteins with tblastn
│   ├── 10_parse_mtdna_prot.py       # Parse mtDNA protein BLAST output
│   └── contig_mapping.txt           # Contig name mapping for AX4 reference and TNS-C-14 
├── assembly                         # De novo assembly & polishing workflows
├── quality-control
│   ├── 01_variant_calling.sh        # Call variants with bcftools/GATK
│   └── 02_run_craq.sh               # Detect structural errors using CRAQ
└── tgr_locus
    ├── 01_tgr_extraction.py         # Extract tgrBC syntenic locus from GFF & bedGraph
    ├── 02_tgr_TE_intersection.py    # Intersect TEs with tgr loci (optional gene annotation)
    ├── 03_tgr_CSE_intersection.py   # Intersect tgr loci with CRAQ-identified structural errors
    └── tgrBClocus_wacA2chdB.bedgraph.txt  
                                    # BedGraph of tgrBC locus coordinates across species

Each subdirectory contains a driver script and helper scripts. You can execute steps individually or chain them into a pipeline.


Citation

If you use these scripts or the assemblies/annotations generated with them, please cite:

Holland M, Ahmed M, Young JM, Drurey JR, McFadyen S, Ostrowski EA, Levin TC. (2025) Hypermutable hotspot enables the rapid evolution of self/non-self recognition genes in Dictyostelium. Proc Natl Acad Sci U S A 122(51): e2520843122. https://doi.org/10.1073/pnas.2520843122

About

Pipeline for long-read assembly, genome annotation (repeats, TEs, telomeres, rRNA/mtDNA), quality control, and targeted tgrBC locus analysis in Dictyostelium genomes.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •