This repository contains a collection of scripts and workflows to:
- Annotate new Dictyostelium genome assemblies (tandem repeats, telomeres, transposable elements, rRNA, and mtDNA)
- Assemble and polish long‐read genome assemblies
- Quality-control variant calls and structural errors
- Extract and analyze the tgrBC self/non-self recognition locus, including its intersections with transposable elements and structural errors
Assembled genomes and annotations have been deposited in NCBI under:
- BioProject: PRJNA1300491
- Zenodo: 10.5281/zenodo.17955877
To use this repository, ensure the following requirements are met:
- Operating System: Linux
- Python Version: Python 3.10+ (tested on 3.12)
- Python Packages: pandas, numpy
- External Tools: SAMtools, Bedtools, Minimap2, BWA, BLAST+, TRF, CRAQ, BCFtools
-
Clone this repository:
git clone https://github.com/MayarMAhmed/dicty_genome.git cd dicty_genome -
Install Python dependencies (recommended using conda):
conda create -n dicty_env python=3.10 conda activate dicty_env conda install numpy pandas
-
Install external tools via conda or your package manager.
├── annotation
│ ├── 01_run_trf.sh # Run Tandem Repeat Finder on genome FASTA
│ ├── 02_extract_telomere_motif.sh # Search for putative telomere motifs
│ ├── 03_filter_telomeres.py # Filter and process telomere motif hits
│ ├── 04_find_TE.sh # Identify transposable elements (RepeatMasker/BLAST)
│ ├── 05_generate_TE_bedgraph.py # Generate bedGraph of TE density
│ ├── 06_run_rrna_blast.sh # Extract rRNA transcripts via BLAST
│ ├── 07_map_rrna_summary.py # Summarize rRNA BLAST results
│ ├── 08_search_mtdna_dna.sh # Identify mtDNA‐containing contigs with BLASTn
│ ├── 09_search_mtdna_prot.sh # Identify mtDNA proteins with tblastn
│ ├── 10_parse_mtdna_prot.py # Parse mtDNA protein BLAST output
│ └── contig_mapping.txt # Contig name mapping for AX4 reference and TNS-C-14
├── assembly # De novo assembly & polishing workflows
├── quality-control
│ ├── 01_variant_calling.sh # Call variants with bcftools/GATK
│ └── 02_run_craq.sh # Detect structural errors using CRAQ
└── tgr_locus
├── 01_tgr_extraction.py # Extract tgrBC syntenic locus from GFF & bedGraph
├── 02_tgr_TE_intersection.py # Intersect TEs with tgr loci (optional gene annotation)
├── 03_tgr_CSE_intersection.py # Intersect tgr loci with CRAQ-identified structural errors
└── tgrBClocus_wacA2chdB.bedgraph.txt
# BedGraph of tgrBC locus coordinates across species
Each subdirectory contains a driver script and helper scripts. You can execute steps individually or chain them into a pipeline.
If you use these scripts or the assemblies/annotations generated with them, please cite:
Holland M, Ahmed M, Young JM, Drurey JR, McFadyen S, Ostrowski EA, Levin TC. (2025) Hypermutable hotspot enables the rapid evolution of self/non-self recognition genes in Dictyostelium. Proc Natl Acad Sci U S A 122(51): e2520843122. https://doi.org/10.1073/pnas.2520843122