From 7d3988df2786d20fe608c22a8b8071ad01dbc1fa Mon Sep 17 00:00:00 2001 From: April Shen Date: Tue, 23 Dec 2025 22:26:52 +0000 Subject: [PATCH 1/5] WIP - input files overview --- docs/input_file_overview.md | 106 ++++++++++++++++++++++++++++++------ 1 file changed, 89 insertions(+), 17 deletions(-) diff --git a/docs/input_file_overview.md b/docs/input_file_overview.md index f58eac8..998c263 100644 --- a/docs/input_file_overview.md +++ b/docs/input_file_overview.md @@ -3,20 +3,32 @@ The eva-sub-cli tool requires the following inputs: - One or several valid VCF files +- Reference genome in FASTA format - Completed metadata spreadsheet -- Reference genome in fasta format VCF files can be either uncompressed or compressed using bgzip. Other types of compression are not allowed and will result in errors during validation. FASTA files must be uncompressed. -The VCF file must adhere to official VCF specifications, and the metadata spreadsheet provides contextual information about the dataset. In the following sections, we will examine each of these inputs in detail. +Only the metadata file is passed directly to the CLI tool. The VCF and FASTA files should be referenced from the +metadata file. -# VCF File +The VCF file must adhere to official VCF specifications, and the metadata spreadsheet provides contextual information +about the dataset. In the following sections, we will examine each of these inputs in detail. -A VCF (Variant Call Format) file is a type of file used in bioinformatics to store information about genetic variants. It includes data about the differences (or variants) between a sample's DNA and a reference genome. Typically, generating a VCF file involves several steps: preparing your sample, sequencing the DNA, aligning it to a reference genome, identifying variants, and finally, formatting this information into a VCF file. The overall goal is to systematically capture and record genetic differences in a standardised format. A VCF file consists of two main parts: the header and the body. +## VCF File -Header: The header contains metadata about the file, such as the format version, reference genome information, and descriptions of the data fields. Each line in the header starts with a double ##, except for the last header line which starts with a single #. +A VCF (Variant Call Format) file is a type of file used in bioinformatics to store information about genetic variants. +It includes data about the differences (or variants) between a sample's DNA and a reference genome. Typically, +generating a VCF file involves several steps: preparing your sample, sequencing the DNA, aligning it to a +reference genome, identifying variants, and finally, formatting this information into a VCF file. The overall goal is +to systematically capture and record genetic differences in a standardised format. + +A VCF file consists of two main parts: the header and the body. + +Header: The header contains metadata about the file, such as the format version, reference genome information, and +descriptions of the data fields. Each line in the header starts with a double ##, except for the last header line which +starts with a single #. ``` ##fileformat=VCFv4.2 @@ -25,21 +37,81 @@ Header: The header contains metadata about the file, such as the format version, ##FORMAT= ``` -Body: The body of the VCF file contains the actual variant data, with each row representing a single variant. The columns in the body are: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Sample Columns +Body: The body of the VCF file contains the actual variant data, with each row representing a single variant. The +columns in the body are: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Sample Columns ``` #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT [SampleIDs...] -``` +``` + +## FASTA File + +This is the reference genome that your variants were called against, in uncompressed FASTA format. The EVA uses this to +check that the reference alleles are set correctly in the VCF files. + +The reference genome must be INSDC-registered, which means it is publicly available in one of the archives within the +INSDC consortium: [Genbank](https://www.ncbi.nlm.nih.gov/genbank/), [ENA](https://www.ebi.ac.uk/ena/browser/home), or +[DDBJ](https://www.ddbj.nig.ac.jp/index-e.html). This ensures the long-term availability of the genome and the +reusability of your data. + +## Metadata Spreadsheet + +The metadata spreadsheet provides comprehensive contextual information about the dataset, ensuring that each submission +is accompanied by detailed descriptions that facilitate proper understanding and use of the data. Key elements included +in the metadata spreadsheet are analysis and project information, sample information, sequencing methodologies, and +experimental details. While not all fields are required, users are strongly encouraged to provide as much information +as possible to enhance the completeness and usefulness of the metadata. + +The spreadsheet is organized into editable tabs, designed for metadata entry, and non-editable helper tabs, which offer +detailed explanations and guidance for each column. Users are required to complete all relevant sections within the +editable tabs. Mandatory fields in each section are indicated in bold to highlight essential information that must be +provided for a valid submission. Fields highlighted in green indicate an either/or choice. This means you should fill +in the choice most relevant for your submission, but not both. + +Below we go through some important details for each of the tabs. + +### Submitter Details + +This sheet captures basic information about the person or team submitting the data. This includes the lab name and +center, which is the name of the submitting institution or organization and is the name that will be visible once the +project is live on the EVA website. + +### Project + +The objective of this sheet is to gather general information about the Project. If you are submitting to an existing +project, you can skip the other details and just provide the project accession and analyses will be linked to that +project. In case of a new project, please provide the relevant details including submitter, submitting center, +collaborators, project title, description and publications. + +One important column to note is the Hold Date, which is the date until which the data should be kept private. EVA will +release the data automatically after this date. If it is missing, the default value is three days after the date of +submission. + +### Analysis + +For EVA, an analysis is a grouping of samples and data files. This sheet allows EVA to link VCF files to a project and +to other EVA analyses. Additionally, this worksheet contains experimental metadata detailing the methodology of each +analysis. This includes a local path to your reference FASTA file. + +One project can have multiple associated analyses. EVA links analyses to samples and VCF files through the analysis +alias, which is a shortened identifier you must provide for each analysis. + +### Sample + +This is where you describe the biological samples used for your analyses. Each row describes one sample and must include +Analysis Alias to indicate which analysis it belongs to, and Sample Name in VCF which is the exact name of the sample +as it appears in the VCF file. + +We accept preregistered samples, which should be provided using BioSamples sample or sampleset accessions. Please +ensure these are publicly accessible, as otherwise EVA will not be able to validate them. -# Metadata Spreadsheet +If your samples are not yet accessioned, and are therefore novel, please use the "Novel sample(s)" sections of the +Sample worksheet to have them registered at BioSamples. Make sure to fill in the required fields in bold, including the +taxonomy ID, geographic location (chosen from the controlled vocabulary in the drop-down menu), and collection date +(in YYYY-MM-DD format). -The spreadsheet provides comprehensive contextual information about the dataset, ensuring that each submission is accompanied by detailed descriptions that facilitate proper understanding and use of the data. Key elements included in the metadata spreadsheet are analysis and project information, sample information, sequencing methodologies, experimental details. -The spreadsheet is organized into editable tabs, designed for metadata entry, and non-editable helper tabs, which offer detailed explanations and guidance for each column. Users are required to complete all relevant sections within the editable tabs. Mandatory fields in each section are indicated in bold to highlight essential information that must be provided for a valid submission. However, users are strongly encouraged to provide as much additional information as possible to enhance the completeness and usefulness of the metadata. +### Files -| WORKSHEET | EXPLANATION | -|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| Submitter Details | This sheet captures the details of the submitter | -| Project | The objective of this sheet is to gather general information about the Project. If you are submitting to an existing project, you can skip the other details and just provide the project accession and analyses will be linked to that project. In case of a new project, please provide the relevant details including submitter, submitting centre, collaborators, project title, description and publications. | -| Sample | Projects consist of analyses that are run on samples. We accept sample information in the form of BioSample, ENA or EGA accession(s). We also accept BioSamples sampleset accessions. If your samples are not yet accessioned, and are therefore novel, please use the "Novel sample(s)" sections of the Sample(s) worksheet to have them registered at BioSample | -| Analysis | For EVA, each analysis is one vcf file, plus an unlimited number of ancillary files. This sheet allows EVA to link vcf files to a project and to other EVA analyses. Additionally, this worksheet contains experimental meta-data detailing the methodology of each analysis. Important to note; one project can have multiple associated analyses | -| Files | Filenames and associated checking data associated with this EVA submission should be entered into this worksheet. Each file should be linked to exactly one analysis. | +This sheet lists the VCF files in your submission. These should be provided as local paths within your system, so that +the CLI tool can locate them, along with the Analysis Alias which should be associated with each file. Each file should +be linked to exactly one analysis. From 0aa5e40e3584949449d42f76a47c8e68a0612a34 Mon Sep 17 00:00:00 2001 From: April Shen Date: Wed, 24 Dec 2025 20:45:19 +0000 Subject: [PATCH 2/5] update input file overview --- docs/input_file_overview.md | 43 ++++++++++++++++++------------------- 1 file changed, 21 insertions(+), 22 deletions(-) diff --git a/docs/input_file_overview.md b/docs/input_file_overview.md index 998c263..240ece9 100644 --- a/docs/input_file_overview.md +++ b/docs/input_file_overview.md @@ -1,4 +1,6 @@ -# Overview of Input Files +# Overview of Input Files + +View our [video tutorial on input files](link to final tutorial). The eva-sub-cli tool requires the following inputs: @@ -6,44 +8,41 @@ The eva-sub-cli tool requires the following inputs: - Reference genome in FASTA format - Completed metadata spreadsheet -VCF files can be either uncompressed or compressed using bgzip. -Other types of compression are not allowed and will result in errors during validation. -FASTA files must be uncompressed. - Only the metadata file is passed directly to the CLI tool. The VCF and FASTA files should be referenced from the metadata file. -The VCF file must adhere to official VCF specifications, and the metadata spreadsheet provides contextual information -about the dataset. In the following sections, we will examine each of these inputs in detail. +In the following sections, we will examine each of these inputs in detail. ## VCF File A VCF (Variant Call Format) file is a type of file used in bioinformatics to store information about genetic variants. -It includes data about the differences (or variants) between a sample's DNA and a reference genome. Typically, -generating a VCF file involves several steps: preparing your sample, sequencing the DNA, aligning it to a -reference genome, identifying variants, and finally, formatting this information into a VCF file. The overall goal is -to systematically capture and record genetic differences in a standardised format. - -A VCF file consists of two main parts: the header and the body. +The EVA requires data files to conform to the official VCF specifications, so that the data can be interpreted +consistently by other databases and researchers looking to reuse the data. Many tools can be used to generate VCF files, +including [BCFtools](https://samtools.github.io/bcftools/), [PLINK](https://www.cog-genomics.org/plink/2.0/), and +[GATK](https://gatk.broadinstitute.org). -Header: The header contains metadata about the file, such as the format version, reference genome information, and -descriptions of the data fields. Each line in the header starts with a double ##, except for the last header line which -starts with a single #. +Besides being compliant with the VCF specification, we also require that each VCF file contains the necessary +evidence linking the variant to its source. This evidence can be in the form of sample genotypes or allele frequencies. +Here is an example of how sample genotypes might look like in a VCF file: ``` ##fileformat=VCFv4.2 -##INFO= -##FILTER= ##FORMAT= +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2 +1 10583 rs58108140 G A 100 PASS . GT 0/1 1/1 ``` -Body: The body of the VCF file contains the actual variant data, with each row representing a single variant. The -columns in the body are: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Sample Columns - +And here is how allele frequencies would look: ``` -#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT [SampleIDs...] +##fileformat=VCFv4.2 +##INFO= +#CHROM POS ID REF ALT QUAL FILTER INFO +1 10583 rs58108140 G A 100 PASS AF=0.25 ``` +When using the eva-sub-cli tool, VCF files can be either uncompressed or compressed using bgzip. Other types of +compression are not allowed and will result in errors during validation. + ## FASTA File This is the reference genome that your variants were called against, in uncompressed FASTA format. The EVA uses this to From e8b6712045bf0af528b3991e892ae6012d52bb96 Mon Sep 17 00:00:00 2001 From: April Shen Date: Wed, 24 Dec 2025 21:23:55 +0000 Subject: [PATCH 3/5] minor edits --- docs/input_file_overview.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/input_file_overview.md b/docs/input_file_overview.md index 240ece9..bc5f1b7 100644 --- a/docs/input_file_overview.md +++ b/docs/input_file_overview.md @@ -78,9 +78,9 @@ project is live on the EVA website. ### Project The objective of this sheet is to gather general information about the Project. If you are submitting to an existing -project, you can skip the other details and just provide the project accession and analyses will be linked to that -project. In case of a new project, please provide the relevant details including submitter, submitting center, -collaborators, project title, description and publications. +project, you can skip the other details and just provide the project accession; the data in this submission will be +added as analyses linked to that project. In case of a new project, please provide the relevant details including +submitter, submitting center, collaborators, project title, description and publications. One important column to note is the Hold Date, which is the date until which the data should be kept private. EVA will release the data automatically after this date. If it is missing, the default value is three days after the date of @@ -88,9 +88,9 @@ submission. ### Analysis -For EVA, an analysis is a grouping of samples and data files. This sheet allows EVA to link VCF files to a project and +For EVA, an analysis is a grouping of samples and data files. This sheet allows us to link VCF files to a project and to other EVA analyses. Additionally, this worksheet contains experimental metadata detailing the methodology of each -analysis. This includes a local path to your reference FASTA file. +analysis. This includes a local path to your reference FASTA file, as described [above](#FASTA-file). One project can have multiple associated analyses. EVA links analyses to samples and VCF files through the analysis alias, which is a shortened identifier you must provide for each analysis. @@ -98,15 +98,15 @@ alias, which is a shortened identifier you must provide for each analysis. ### Sample This is where you describe the biological samples used for your analyses. Each row describes one sample and must include -Analysis Alias to indicate which analysis it belongs to, and Sample Name in VCF which is the exact name of the sample -as it appears in the VCF file. +the Analysis Alias to indicate which analysis it belongs to, and "Sample Name in VCF" which is the exact name of the +sample as it appears in the VCF file. We accept preregistered samples, which should be provided using BioSamples sample or sampleset accessions. Please ensure these are publicly accessible, as otherwise EVA will not be able to validate them. If your samples are not yet accessioned, and are therefore novel, please use the "Novel sample(s)" sections of the Sample worksheet to have them registered at BioSamples. Make sure to fill in the required fields in bold, including the -taxonomy ID, geographic location (chosen from the controlled vocabulary in the drop-down menu), and collection date +taxonomy ID, geographic location (chosen from the controlled vocabulary in the drop-down menu), and collection date (in YYYY-MM-DD format). ### Files From 1ccc42bd773035b3f9130ce9920d74c497a64392 Mon Sep 17 00:00:00 2001 From: April Shen Date: Wed, 7 Jan 2026 11:25:21 +0000 Subject: [PATCH 4/5] Apply suggestions from code review Co-authored-by: Timothee Cezard --- docs/input_file_overview.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/input_file_overview.md b/docs/input_file_overview.md index bc5f1b7..51f754e 100644 --- a/docs/input_file_overview.md +++ b/docs/input_file_overview.md @@ -22,7 +22,7 @@ including [BCFtools](https://samtools.github.io/bcftools/), [PLINK](https://www. [GATK](https://gatk.broadinstitute.org). Besides being compliant with the VCF specification, we also require that each VCF file contains the necessary -evidence linking the variant to its source. This evidence can be in the form of sample genotypes or allele frequencies. +evidence linking the variant to its biological source. This evidence can be in the form of sample genotypes or allele frequencies. Here is an example of how sample genotypes might look like in a VCF file: ``` @@ -48,10 +48,10 @@ compression are not allowed and will result in errors during validation. This is the reference genome that your variants were called against, in uncompressed FASTA format. The EVA uses this to check that the reference alleles are set correctly in the VCF files. -The reference genome must be INSDC-registered, which means it is publicly available in one of the archives within the +The sequences of the reference genome must be INSDC-registered, which means they are publicly available in one of the archives within the INSDC consortium: [Genbank](https://www.ncbi.nlm.nih.gov/genbank/), [ENA](https://www.ebi.ac.uk/ena/browser/home), or [DDBJ](https://www.ddbj.nig.ac.jp/index-e.html). This ensures the long-term availability of the genome and the -reusability of your data. +reusability of your variation data. ## Metadata Spreadsheet From 783c1e4b88974416fda05c0dd9620333eb0b93bd Mon Sep 17 00:00:00 2001 From: April Shen Date: Wed, 7 Jan 2026 12:50:59 +0000 Subject: [PATCH 5/5] address review comments --- docs/input_file_overview.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/docs/input_file_overview.md b/docs/input_file_overview.md index 51f754e..a52399a 100644 --- a/docs/input_file_overview.md +++ b/docs/input_file_overview.md @@ -1,6 +1,6 @@ # Overview of Input Files -View our [video tutorial on input files](link to final tutorial). +View our [video tutorial on input files](https://embl-ebi.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=9284869a-9f57-43c6-8222-b3aa0158941e). The eva-sub-cli tool requires the following inputs: @@ -40,8 +40,7 @@ And here is how allele frequencies would look: 1 10583 rs58108140 G A 100 PASS AF=0.25 ``` -When using the eva-sub-cli tool, VCF files can be either uncompressed or compressed using bgzip. Other types of -compression are not allowed and will result in errors during validation. +When using the eva-sub-cli tool, VCF files can be either uncompressed or compressed using bgzip or gzip. ## FASTA File @@ -106,8 +105,8 @@ ensure these are publicly accessible, as otherwise EVA will not be able to valid If your samples are not yet accessioned, and are therefore novel, please use the "Novel sample(s)" sections of the Sample worksheet to have them registered at BioSamples. Make sure to fill in the required fields in bold, including the -taxonomy ID, geographic location (chosen from the controlled vocabulary in the drop-down menu), and collection date -(in YYYY-MM-DD format). +BioSample name, title, taxonomy ID, geographic location (chosen from the controlled vocabulary in the drop-down menu), +and collection date (in YYYY-MM-DD format). ### Files