subMG aids in the submission of metagenomic study data to the European Nucleotide Archive (ENA). The tool can be used to submit various combinations of samples, reads, (co-)assemblies, bins, and MAGs.
Files and metadata are entered in a single location- no redundant data entry. subMG derives additional information where required, creates samplesheets and manifests, and uploads everything to the ENA.
A command line interface (CLI) is available as well as a graphical user interface (GUI), with both offering the same functionality.
You can use subMG to continue a submission process if some levels of data already exist in ENA (e.g. submitting samples and reads through the ENA web interface, then using subMG to submit the assembly and bins).
subMG was developed within the German NFDI4Microbiota consortium with the objective of simplifying FAIR data sharing in metagenomics and assisting microbiologists with the submission process.
  Â
Please Note
- Only for metagenomic datasets.
- Submit bins before MAGs (MAG ≠metagenomic bin).
- Contact the ENA helpdesk before submitting third-party data.
- Report issues & suggest improvements on GitHub (see Support). We are happy to help!
 Â
 Â
- Installation
- Usage
- Taxonomy Assignment
- Quality Data
- MAG Submission
- Preventing Process Interruption
- Edge Cases
- Support
- Citation
There are three options for using subMG:
- Download the docker container (for CLI use only)
- Install from source (for CLI or GUI use, see below for instructions)
- Download the GUI executable (for Windows and most Linux distributions, see below for hints)
-
You will need the following software installed
- Python 3.8 or higher
- Java 17 or higher
- Python-Tk (optional, only needed for GUI)
- wheel for Python
-
To install subMG
- Download the software
- either download a release and untar/unzip it
- or clone the repository via:
git clone https://github.com/metagenomics/submg/
- Navigate into the directory you just created
- Run
python -m pip install . - Run
submg-cli download-webin
- Download the software
You can now use submg-cli to run the command-line interface or submg-gui to run the graphical user interface.
subMG requires Java 17 or higher for data submission. On Windows, you can use free options like Eclipse/Adoptium or Amazon/Corretto. On Debian-family Linux, you can install via sudo apt install openjdk-21-jre-headless.
Please reach out if you have trouble running the executable on your operating system.
subMG is intended to submit data related to a single (co-)assembly. All samples, sequencing runs, bins and MAGs specified in the config file will be associated with this assembly. If you want to submit data from multiple assemblies, you need to run subMG once for each assembly.
If you'd like to use the subMG CLI, consider following the tutorial
ENA provides a development service to trial your submission before uploading your data to the production server. We strongly suggest submitting to the production server only after a test submission with identical parameters was successful. Otherwise, you might end up with incomplete or incorrect submissions, with no ability to correct or remove them. Unless --development-service 0 is specified, subMG will always submit to the test server.
Study is used synonymously with project here. Before you can submit data using subMG, you need to have a Study object (= a project) in your ENA account. If you intend to submit annotation data, you will also need a locus tag prefix. You can create both through the ENA webin portal on the production server or the development server. Be aware that if you create the Study object on the production server, it can take up to 24 hours until it is available on the development server. This can cause test submissions to fail.
A lot of (meta)data is required for a submission. To use subMG, you need to provide metadata and the locations of your files in a YAML document. Which information is required depends on the type of your submission. You can use the submg-cli makecfg command to create a template for your config file. It will contain only the fields necessary for your specific submission, along with explanations and examples. Additionally, the examples directory contains examples of config files and the associated data. If you are unsure of how to fill out certain fields, please feel free to ask on the GitHub discussions page of this project.
Not all combinations of items can be submitted to ENA. For example, it is not possible to submit only samples and a co-assembly without also submitting the corresponding reads. The figure below illustrates all possible combinations:
This list details the kinds of files subMG ingests (depending on what items are being submitted).
- Sample submission
- no files
- Read submission
fastqor gzippedfastqfiles
- Assembly submission
fastaor gzippedfastafiles- (optional)
BAMfiles for coverage calculation
- Binned contig submission
- directory with
fastaor gzippedfastafiles, one for each contig - (optional)
BAMfiles for coverage calculation - taxonomic information: e.g.
gtdb_to_ncbi_majority_vote.pyoutput, see section Taxonomy Assignment - quality information: e.g. CheckM output, see section Quality Data
- directory with
- MAG submission (see MAG submission section for details)
fastaor gzippedfastaor EMBL-Flatfile- tab separated table with MAG details
Assemblies and bins need a valid NCBI taxonomy (scientific name and taxonomic identifier) for submission. For metagenome submissions, environmental organism-level taxonomy is required.
For example, consider a bin that was classified only on the class level and was determined to belong to class Clostridia. The taxonomy id of the class Clostridia is 186801. However, the correct environmental organism-level taxonomy for the bin is uncultured Clostridia bacterium with the taxid 244328.
subMG will automatically derive environmental organism-level taxonomy when users provide regular NCBI taxonomic annotation.
If you performed taxonomic annotation of bins based on GTDB, you can use the gtdb_to_ncbi_majority_vote.py script of the GTDB-Toolkit to translate your results to NCBI taxonomy. The output of gtdb_to_ncbi_majority_vote.py can directly be parsed by subMG.
These tables can be provided under the NCBI_TAXONOMY_FILES keyword in the configuration form.
You can provide one or more tables with NCBI taxonomy information for your bins (you can find an example at ./examples/data/taxonomy/bacteria_taxonomy.tsv). subMG will use ENA's suggest-for-submission-endpoint to derive taxids that follow the rules for bin taxonomy.
| Bin_id | NCBI_taxonomy |
|---|---|
| bin1 | d__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__;g__;s__ |
| bin2 | d__Archaea;p__;c__;o__;f__;g__;s__ |
| bin3 | d__Eukaryota;p__;c__Dothideomycetes;o__Pleosporales;f__Massarinaceae;g__Helminthosporium;s__ |
These tables can be provided under the NCBI_TAXONOMY_FILES keyword in the configuration form.
The values in the bin_id field have to match the basenames of the bin fasta files (e.g. there needs to be a bin1.fasta, bin1.fna or similar in your BINS_DIRECTORY)
The string in the 'NCBI_taxonomy' column has to adhere to the format shown in the above example: Taxonomic ranks are separated by semicolons. On each rank, a letter indicating the rank is followed by two underscores and the classification at that rank. The ranks have to be in the order 'domain', 'phylum', 'class', 'order', 'family', 'genus', 'species'. If a classification at a certain rank is unavailable, the rank itself still needs to be present in the string (e.g. "s__" where the species taxonomy is unknown).
In some cases, subMG will be unable to assign a valid taxonomy to a bin. The submission will be aborted and you will be informed which bins are causing problems. In such cases, you must determine the correct scientific name and taxid for the bin and specify it in a manual taxonomy file. This file then has to be referenced in the MANUAL_TAXONOMY field of your config. Remember to use environmental organism-level taxonomies when manually defining the taxonomy for your bin.
A possible reason for a failed taxonomic assignment is that no proper environmental organism-level taxonomies exist yet (this happens more often than one might expect). You can create a taxon request in the ENA Webin Portal to register the taxon.
In cases where subMG is unable to assign a valid taxonomy based on the NCBI taxonomy file, you can provide taxonomies for some of your bins as a tab-separated table. The table has to be referenced under the keyword MANUAL_TAXONOMY in the configuration form. The table matches each Bin_id to a Scientific_name and a Tax_id. If a bin is present in this document, the taxonomic data from other sources will be ignored. Each Bin_id has to match the basename of binned contigs fasta file.
| Bin_id | Scientific_name | Tax_id |
|---|---|---|
| bin3 | uncultured Paracoccus sp. | 189685 |
An example of such a document can be found in ./examples/data/taxonomy/manual_taxonomy_3bins.tsv. Again, values in the bin_id field have to match the basenames of the bin fasta files (e.g. there needs to be a bin1.fasta, bin1.fna or similar matching bin1 of your table).
ENA provides a guideline for choosing taxonomy. You can query ENA's suggest-for-submission-endpoint to find the correct taxid for a bin programmatically or directly through the browser (e.g. by navigating to https://www.ebi.ac.uk/ena/taxonomy/rest/suggest-for-submission/escherichia).
Binned contigs and MAGs need to be submitted with quality data (contamination and completeness of each bin/MAG). In the QUALITY_FILE field of the configuration form you need to provide either the output of CheckM / CheckM2 or a table with the columns Bin_id, Completeness, Contamination. Completeness and Contamination values represent percentages (e.g. 93.2 means 93.2%). Each Bin_id has to correspond to the basename of a MAG or binned contigs fasta file. Fields have to be tab-separated. An example of a valid table is seen below.
| Bin_id | some_other_column | Contamination | Completeness |
|---|---|---|---|
| bin15 | other data | 3.14 | 92.91 |
| bin17 | other data | 14.29 | 44.32 |
Please consult the Edge Cases section for bins with more than 100% contamination.
If you have assembled high-quality bins from your metagenome, you can submit them as MAGs (after submitting them as bins). Some additional metadata is needed for a MAG submission.
A MAG assembly can be submitted either as a 'Contig Assembly' or a 'Chromosome Assembly'. Please consult the ENA documentation for further information. You will need to provide additional data for a Chromosome Assembly submission (see below).
If you are submitting MAGs, you need to provide a .tsv file and specify it in the MAGS_METADATA_FILE field of your config file. The file needs to have the columns Bin_id, Sample_id, Quality_category, Flatfile_path and Unlocalised_path. An example of a MAGS_METADATA_FILE can be found in ./examples/data/mags/mags_metadata.tsv.
Depending on your submission, not all columns have to be filled out.
Bin_id: Identifier of the bin. Has to be identical to the identifier used in the name of the fasta file, the taxonomy .tsv files etc.Quality_category: 'finished', 'high' or 'medium' as defined by ENA here (note the requirements regarding RNA sequences for the 'high' and 'finished' categories).Flatfile_path: For chromosome assemblies only. Either a.fastafile or an EMBL-Flatfile can be used for MAG submission. If you leave the field empty, the.fastafile of the corresponding bin will be used. If you want to provide annotation data, you need to provide a path to a flatfile. EMBLmyGFF3 provides a convenient way to create flatfiles based on your annotation data.Unlocalised_path: For chromosome assemblies only. Optional. Path to a .txt file containing the unlocalised contigs of the bin.
Using the table below, MAG m1 will be submitted as a medium quality contig assembly without annotation. m2 will be submitted as a high quality contig assembly and include annotation. MAG m3 will be submitted as a finished chromosome assembly, including annotation.
| Bin_id | Quality_category | Flatfile_path | Chromosomes_path | Unlocalised_path |
|---|---|---|---|---|
| m1 | medium | |||
| m2 | high | /path/to/m2_flatfile.tsv | ||
| m3 | finished | /path/to/m3_flatfile.tsv | /path/to/m3_chromosome.txt | /path/to/m3_unlocalised.txt |
A submission can take several hours to complete. We recommend using nohup, tmux or similar to prevent the submission process from being interrupted.
If your bins are the result of dereplicating data from a single assembly, you can use subMG as described above. If your bins are the result of dereplicating data from multiple different assemblies, you need to split them based on which assembly they belong to. You then run subMG separately for each assembly (together with the corresponding set of bins).
When calculating completeness and contamination of a bin with tools like CheckM, contamination values above 100% can occur. Usually, this is not an error. However, the ENA API will refuse to accept bins with contamination values above 100%. subMG will automatically exclude bins with contamination values above 100% from the submission. If you absolutely need to submit such (presumably low-quality) bins, you need to manually set their contamination value to 100 in the 'QUALITY_FILE' you provide under the bins section.
subMG is being actively developed. Please use the GitHub issue tracker to report problems. A discussions page is available for questions, comments and suggestions.
When using subMG in your work, please cite https://doi.org/10.1186/s13040-025-00453-wd


