diff --git a/images/AToL-architecture.png b/images/AToL-architecture.png deleted file mode 100644 index bd0ccd5..0000000 Binary files a/images/AToL-architecture.png and /dev/null differ diff --git a/images/atol_and_insdc.png b/images/atol_and_insdc.png new file mode 100644 index 0000000..7bf2db4 Binary files /dev/null and b/images/atol_and_insdc.png differ diff --git a/index.md b/index.md index ad67a4e..950d57f 100644 --- a/index.md +++ b/index.md @@ -4,68 +4,124 @@ description: Documentation for the Australian Tree of Life (AToL) toc: false --- -## About the Australian Tree of Life +The Australian Tree of Life Bioinformatics team (AToL Bioinformatics) is +developing infrastructure for the rapid generation and publication of genome +assemblies and annotations. The current focus of AToL Bioinformatics is +optimising and automating these processes in the Genome Engine. -The Australian Tree of Life (AToL) project is developing infrastructure for the rapid generation and publication of genome assemblies and annotations. The current focus of the AToL project is optimising and automating these processes in the Genome Engine. +Our Genome Engine was inspired by the Wellcome Sanger Institute's [Genome +Engine](https://www.sanger.ac.uk/tool/genome-engine/), and uses some of their +pipelines under the hood. -You can learn more about the initiative here: [Australian Tree of Life](https://www.biocommons.org.au/atol) - -The AToL Genome Engine was inspired by the Wellcome Sanger Institute's [Genome Engine](https://www.sanger.ac.uk/tool/genome-engine/), and uses some of their pipelines under the hood. +You can learn more about the Australian Tree of Life activity +[here](https://www.biocommons.org.au/atol). ## What does the Genome Engine do? -The AToL Genome Engine is an automated workflow for assembling and annotating genome sequences from raw sequence data, brokering data to International Nucleotide Sequence Database Collaboration (INSDC) repositories, and drafting short Genome Notes. +The Genome Engine is a semi-automated workflow for assembling and annotating +genome sequences from raw sequence data, brokering data to International +Nucleotide Sequence Database Collaboration (INSDC) repositories, and drafting +short Genome Notes. This involves: - - Ingesting raw sequence data from the [Bioplatforms Australia Data Portal](https://data.bioplatforms.com/) + - Ingesting raw sequence data from the [Bioplatforms Australia Data + Portal](https://data.bioplatforms.com/) - Processing sampling and sequencing metadata - Assembling genome sequences from sequence read data - Annotating assembled genomes - - Brokering sample metadata, sequence reads, and genome assemblies to the [European Nucleotide Archive](https://www.ebi.ac.uk/ena/browser/home) (ENA) - - Generating an automatic Genome Note providing details and metrics about sampling, sequencing and assembly + - Brokering sample metadata, sequence reads, and genome assemblies to the + [European Nucleotide Archive](https://www.ebi.ac.uk/ena/browser/home) (ENA) + - Generating an automatic Genome Note providing details and metrics about + sampling, sequencing and assembly -At present, the Genome Engine is configured to ingest data generated as part of [Bioplatforms Australia’s](https://bioplatforms.com/) Framework Initiatives which are available from the Bioplatforms Australia Data Portal. In future, we intend to make the Genome Engine available to any Australian researcher for use with their own sequencing data. +At present, the Genome Engine is configured to ingest data generated as part of +[Bioplatforms Australia’s](https://bioplatforms.com/) Framework Initiatives, +which are available from the Bioplatforms Australia Data Portal. In future, we +intend to make the Genome Engine available to any Australian researcher for use +with their own sequencing data. ## How does the Genome Engine work? ### Data retrieval and processing -The Genome Engine accesses sequence data and metadata in bulk from the [Bioplatforms Australia Data Portal](https://data.bioplatforms.com/) API. The metadata are provided by the collecting researcher and sample preparation and sequencing facilities. +The Genome Engine accesses sequence data and metadata in bulk from the +[Bioplatforms Australia Data Portal](https://data.bioplatforms.com/) API. The +metadata are provided by the collecting researcher and sample preparation and +sequencing facilities. -Packages are filtered to select those relevant to genome assembly and annotation, and metadata are validated and mapped to an intermediary, INSDC-compliant schema. +Packages are filtered to select those relevant to genome assembly and +annotation, and metadata are validated and mapped to an intermediary, +INSDC-compliant schema. -Taxon and sample identifiers are extracted to determine which packages can be combined in the assembly process and to retrieve species information from the [NCBI Taxonomy](http://www.ncbi.nlm.nih.gov/taxonomy). +Taxon and sample identifiers are extracted to determine which packages can be +combined in the assembly process and to retrieve species information from the +[NCBI Taxonomy](http://www.ncbi.nlm.nih.gov/taxonomy). ### Genome assembly and annotation -Sequence read data are processed and assembled on High-Performance Computing (HPC) facilities at the [Pawsey Supercomputing Research Centre](https://pawsey.org.au/), -provided by the [Australian BioCommons Leadership Share](https://www.biocommons.org.au/ables) (ABLeS) program. +Sequence read data are processed and assembled on High-Performance Computing +(HPC) facilities at the [Pawsey Supercomputing Research +Centre](https://pawsey.org.au/), provided by the [Australian BioCommons +Leadership Share](https://www.biocommons.org.au/ables) (ABLeS) program. -The assembly pipeline used is an adaptation of the [Sanger Tree of Life (ToL) assembly pipeline](https://pipelines.tol.sanger.ac.uk/genomeassembly), which includes the following steps: +The assembly pipeline used is an adaptation of the [Sanger Tree of Life (ToL) +assembly pipeline](https://pipelines.tol.sanger.ac.uk/genomeassembly), which +includes the following steps: - assembly using [hifiasm](https://github.com/chhylp123/hifiasm) - - redundant contig removal with [purge_dups](https://github.com/dfguan/purge_dups) - - optional haplotype resolution with hifiasm and scaffolding with [YaHS](https://github.com/c-zhou/yahs) if Hi-C data is available + - redundant contig removal with + [purge_dups](https://github.com/dfguan/purge_dups) + - optional haplotype resolution with hifiasm and scaffolding with + [YaHS](https://github.com/c-zhou/yahs) if Hi-C data is available -Quality assessment and annotation of assembled genomes are currently in development. +Oxford Nanopore-based contig building and scaffolding are currently in +development, along with quality assessment and annotation of assembled genomes. ### Data brokering -The data broker component of the Genome Engine uses sample, sequencing, and assembly metadata to submit files automatically to the ENA. BioSample information is submitted using the [ToL sample checklist](https://www.ebi.ac.uk/ena/browser/view/ERC000053), a minimum standard for sample metadata devised by the [Darwin Tree of Life project](https://www.darwintreeoflife.org/) to facilitate data contextualisation and interoperability. Experiment, read, and assembly data are submitted according to ENA’s standards and schemas. In order to comply with these standards, certain metadata fields in the original Bioplatforms metadata must be filled and vocabulary terms used (see the [FAQ](https://australianbiocommons.github.io/atol/faq) for more information about metadata requirements). AToL’s metadata mapping processes allow for these metadata to be formatted in XML files for programmatic submission to the ENA. - -The submitted XML files include the data release date, which is determined according to the embargo release date specified in the Bioplatforms data portal. Once records are made public on their release date, they are exchanged with and made available from other INSDC databases at the US National Center for Biotechnology Information ([NCBI](https://www.ncbi.nlm.nih.gov/)) and the DNA Data Bank of Japan ([DDBJ](https://www.ddbj.nig.ac.jp/index-e.html)). +The data broker component of the Genome Engine uses sample, sequencing, and +assembly metadata to submit files automatically to the ENA. BioSample +information is submitted using the [ToL sample +checklist](https://www.ebi.ac.uk/ena/browser/view/ERC000053), a minimum +standard for sample metadata devised by the [Darwin Tree of Life +project](https://www.darwintreeoflife.org/) to facilitate data +contextualisation and interoperability. Experiment, read, and assembly data are +submitted according to ENA’s standards and schemas. In order to comply with +these standards, certain metadata fields in the original Bioplatforms metadata +must be filled and vocabulary terms used (see the +[FAQ](https://australianbiocommons.github.io/atol/faq) for more information +about metadata requirements). The Genome Engine's metadata mapping processes +allow for these metadata to be formatted in XML files for programmatic +submission to the ENA. + +The submitted XML files include the data release date, which is determined +according to the embargo release date specified in the Bioplatforms data +portal. Once records are made public on their release date, they are exchanged +with and made available from other INSDC databases at the US National Center +for Biotechnology Information ([NCBI](https://www.ncbi.nlm.nih.gov/)) and the +DNA Data Bank of Japan ([DDBJ](https://www.ddbj.nig.ac.jp/index-e.html)). ### Genome Note generation -Once a genome has been assembled, a Genome Note document is generated, outlining key metadata and assembly metrics. The Genome Note pipeline populates a template document with metadata values relating to taxonomy, specimen collection, nucleic acid extraction, sequencing, and assembly, and key metrics calculated in the assembly pipeline. The Genome Note also contains the accession numbers generated during brokering to the ENA. The project lead and project collaborators (as they are listed in the Bioplatforms metadata) are named as first and second authors. +Once a genome has been assembled, a Genome Note document is generated, +outlining key metadata and assembly metrics. The Genome Note pipeline populates +a template document with metadata values relating to taxonomy, specimen +collection, nucleic acid extraction, sequencing, and assembly, and key metrics +calculated in the assembly pipeline. The Genome Note also contains the +accession numbers generated during brokering to the ENA. The project lead and +project collaborators (as they are listed in the Bioplatforms metadata) are +named as first and second authors. -Genome Notes will be made available to researchers prior to release to provide an opportunity to manually edit and add content. +Genome Notes will be made available to researchers prior to release to provide +an opportunity to manually edit and add content. -![Diagram of overall AToL architecture](images/AToL-architecture.png) -*Australian Tree of Life architecture overview. Note: the interactive AToL web application is currently in development.* +![Diagram of genome engine data flow](./images/atol_and_insdc.png) *Genome +Engine data flow.* ## Partners -The Australian Tree of Life is a collaborative initiative. It is co-funded by Bioplatforms Australia and the Minderoo Foundation, and supported by project partners at the University of Melbourne and QCIF. The AGRF are hosting a PhD student intern. +AToL Bioinformatics is co-funded by Bioplatforms Australia and the Minderoo +Foundation, and supported by project partners at the University of Melbourne +and QCIF. The AGRF are hosting a PhD student intern. Bioplatforms Australia is enabled by NCRIS. @@ -83,6 +139,8 @@ Bioplatforms Australia is enabled by NCRIS. ## Acknowledgements -This documentation page makes use of the [ELIXIR toolkit theme](https://github.com/ELIXIR-Belgium/elixir-toolkit-theme). +This documentation page makes use of the [ELIXIR toolkit +theme](https://github.com/ELIXIR-Belgium/elixir-toolkit-theme). -{% include image.html file="elixir-toolkit-theme_logo.svg" alt="Elixir Toolkit Theme logo" max-width="15em" %} +{% include image.html file="elixir-toolkit-theme_logo.svg" alt="Elixir Toolkit +Theme logo" max-width="15em" %} diff --git a/pages/faq.md b/pages/faq.md index 9aa2880..479192d 100644 --- a/pages/faq.md +++ b/pages/faq.md @@ -1,47 +1,99 @@ --- title: Frequently Asked Questions -description: Frequently Asked Questions relating to the AToL Genome Engine. +description: Frequently Asked Questions relating to the Genome Engine. toc: false --- ### What metadata does the Genome Engine require? -The list of Bioplatforms metadata fields used in the Genome Engine is available [here](https://docs.google.com/spreadsheets/d/1qtVF_owSLjjkDxfCqMiG7rV2omp-F70-3HgHaXIeqOI/edit?usp=sharing). The list contains fields which are collected or defined during sampling, sequencing, or internal Bioplatforms processing steps. Fields with controlled vocabularies or other value or format constraints are designated. These fields have been selected to comply with ENA and broader Tree of Life standards to sufficiently document provenance and facilitate interoperability and reusability. +The list of Bioplatforms metadata fields used in the Genome Engine is available +[here](https://docs.google.com/spreadsheets/d/1qtVF_owSLjjkDxfCqMiG7rV2omp-F70-3HgHaXIeqOI/edit?usp=sharing). +The list contains fields which are collected or defined during sampling, +sequencing, or internal Bioplatforms processing steps. Fields with controlled +vocabularies or other value or format constraints are designated. These fields +have been selected to comply with ENA and broader Tree of Life standards to +sufficiently document provenance and facilitate interoperability and +reusability. -Controlled vocabulary terms are listed [here](https://docs.google.com/spreadsheets/d/1qtVF_owSLjjkDxfCqMiG7rV2omp-F70-3HgHaXIeqOI/edit?gid=1263334219#gid=1263334219). Use of controlled vocabulary terms is necessary for accurate data filtering and for compliance with ENA and Tree of Life standards. +Controlled vocabulary terms are listed +[here](https://docs.google.com/spreadsheets/d/1qtVF_owSLjjkDxfCqMiG7rV2omp-F70-3HgHaXIeqOI/edit?gid=1263334219#gid=1263334219). +Use of controlled vocabulary terms is necessary for accurate data filtering and +for compliance with ENA and Tree of Life standards. ### What do I do if my organism doesn’t have a taxon ID? All data ingested by the Genome Engine must have a valid NCBI taxon ID. -For taxa not yet incorporated into the NCBI taxonomy, you will need to make a submission for a new/temporary taxon. Instructions for requesting a new taxon are available in [ENA’s documentation](https://ena-docs.readthedocs.io/en/latest/faq/taxonomy_requests.html). +For taxa not yet incorporated into the NCBI taxonomy, you will need to make a +submission for a new/temporary taxon. Instructions for requesting a new taxon +are available in [ENA’s +documentation](https://ena-docs.readthedocs.io/en/latest/faq/taxonomy_requests.html). -If the taxon has not yet been formally described, you can request a new taxon ID using an informal name. Once a formal classification has been published, the naming information for that taxon ID can be updated. +If the taxon has not yet been formally described, you can request a new taxon +ID using an informal name. Once a formal classification has been published, the +naming information for that taxon ID can be updated. ### What attribution or authorship information will be brokered to ENA/INSDC? -The project lead listed in the Bioplatforms metadata will be entered in the ‘center name’ field for each record. This field is used by broker accounts to designate the identity of the party on whose behalf the data are being deposited. The ‘center name’ will be visible on the record pages in ENA, and will be included in downloaded metadata. +The project lead listed in the Bioplatforms metadata will be entered in the +‘center name’ field for each record. This field is used by broker accounts to +designate the identity of the party on whose behalf the data are being +deposited. The ‘center name’ will be visible on the record pages in ENA, and +will be included in downloaded metadata. ### What quality assembly can I expect? Will it meet the Earth BioGenome standards? -We aim to generate assemblies which meet the [Earth BioGenome Project (EBP) assembly standards](https://www.earthbiogenome.org/report-on-assembly-standards) where possible. These standards stipulate multiple criteria, but they are often represented in shorthand as 6.C.Q40, equating to a contig N50 of 1Mb or above, a scaffold N50 at chromosomal scale, and an error rate of below 1 in 10,000. However, the AToL Genome Engine does not mandate manual curation processes, which are required to verify chromosome-level assembly. Additionally, the Genome Engine will generate contig-level assemblies if no scaffolding data (e.g. Hi-C) are available. Hence, assemblies will be assessed according to the applicable metrics (contig N50 of at least 1Mb, QV of at least 40, less than 5% false duplications, greater than 90% kmer completeness, and greater than 90% single copy conserved genes (according to BUSCO)). Only assemblies which meet these criteria will be automatically deposited into ENA. - -The genome sequence generated by the Genome Engine will be made available, along with quality metrics, to researchers prior to brokering, providing the opportunity for manual curation. Researchers can choose to further assess and curate genomes to meet the full EBP standards. +We aim to generate assemblies which meet the [Earth BioGenome Project (EBP) +assembly +standards](https://www.earthbiogenome.org/report-on-assembly-standards) where +possible. These standards stipulate multiple criteria, but they are often +represented in shorthand as 6.C.Q40, equating to a contig N50 of 1Mb or above, +a scaffold N50 at chromosomal scale, and an error rate of below 1 in 10,000. +However, the Genome Engine does not mandate manual curation processes, which +are required to verify chromosome-level assembly. Additionally, the Genome +Engine will generate contig-level assemblies if no scaffolding data (e.g. Hi-C) +are available. Hence, assemblies will be assessed according to the applicable +metrics (contig N50 of at least 1Mb, QV of at least 40, less than 5% false +duplications, greater than 90% kmer completeness, and greater than 90% single +copy conserved genes (according to BUSCO)). Only assemblies which meet these +criteria will be automatically deposited into ENA. + +The genome sequence generated by the Genome Engine will be made available, +along with quality metrics, to researchers prior to brokering, providing the +opportunity for manual curation. Researchers can choose to further assess and +curate genomes to meet the full EBP standards. ### What happens if my assembly "fails"/is rejected? -If the assembly fails to pass the applicable EBP metrics, it will not be brokered automatically, however, it will be provided to you for manual curation. You will have the opportunity to make adjustments to the assembly to improve quality metrics. If the metrics meet the minimum values specified above, the Genome Engine can proceed with brokering. +If the assembly fails to pass the applicable EBP metrics, it will not be +brokered automatically, however, it will be provided to you for manual +curation. You will have the opportunity to make adjustments to the assembly to +improve quality metrics. If the metrics meet the minimum values specified +above, the Genome Engine can proceed with brokering. -We understand that generating an assembly meeting the specified quality criteria may not be feasible for all taxa, for reasons such as small organism size or scarcity of sample material due to a threatened species status. If you believe that a genome is of sufficient quality as can be expected for a taxon, even if it does not meet the applicable EBP minimum metrics, you can request that AToL override its typical quality requirements and proceed with brokering the assembly to ENA. +We understand that generating an assembly meeting the specified quality +criteria may not be feasible for all taxa, for reasons such as small organism +size or scarcity of sample material due to a threatened species status. If you +believe that a genome is of sufficient quality as can be expected for a taxon, +even if it does not meet the applicable EBP minimum metrics, we can override +its typical quality requirements and proceed with brokering the assembly to +ENA. ### Can I check the assembly before it is released? -Genome assemblies will be made available to researchers to provide the opportunity for testing, manual curation and quality control prior to publishing. +Genome assemblies will be made available to researchers to provide the +opportunity for testing, manual curation and quality control prior to +publishing. ### What type of input data does the Genome Engine use? -Our core assembly type is PacBio HiFi reads with optional Hi-C reads for scaffolding and haplotype resolution. We can also use Oxford Nanopore R10+ reads for primary assembly and Ultralong reads for scaffolding in combination with HiFi reads. +Our core assembly type is PacBio HiFi reads with optional Hi-C reads for +scaffolding and haplotype resolution. We can also use Oxford Nanopore R10+ +reads for primary assembly and Ultralong reads for scaffolding in combination +with HiFi reads. ### Can I use the Genome Engine if I'm not a member of a Framework project? -Yes. You will need to submit your raw data to an INSDC database and follow our metadata guidelines. From there, the Genome Engine can ingest your data and do the assembly. Please contact us first if you want to do this. +Yes. You will need to submit your raw data to an INSDC database and follow our +metadata guidelines. From there, the Genome Engine can ingest your data and do +the assembly. Please contact us first if you want to do this. diff --git a/pages/team.md b/pages/team.md index ac95140..fccbf5f 100644 --- a/pages/team.md +++ b/pages/team.md @@ -1,6 +1,6 @@ --- -title: Australian Tree of Life Development Team -description: The team working on the Australian Tree of Life project. +title: Australian Tree of Life Bioinformatics Team +description: The team working on the Genome Engine. toc: false ---