From b5e555a7e29a40ad0b47d3009ad87d4dd83c4afb Mon Sep 17 00:00:00 2001 From: April Shen Date: Thu, 13 Nov 2025 13:08:45 +0000 Subject: [PATCH 1/3] update to validation_overview doc --- docs/how_to_submit.md | 9 +++--- docs/validation_overview.md | 55 +++++++++++++++++++++++++------------ 2 files changed, 42 insertions(+), 22 deletions(-) diff --git a/docs/how_to_submit.md b/docs/how_to_submit.md index de9e531..a37062d 100644 --- a/docs/how_to_submit.md +++ b/docs/how_to_submit.md @@ -70,10 +70,11 @@ To run only a subset of the validation steps, you can use the `--validation_task This can be useful if you want to avoid re-running long-running validations that have already passed. Note that all validation tasks must pass in order to submit. The report will aggregate the results of previous runs and new ones. The possible tasks are: -* `vcf_check` - includes syntax validation and other checks on VCF files -* `assembly_check` - includes all checks involving the FASTA file -* `metadata_check` - includes syntactic and semantic checks on metadata -* `sample_check` - includes sample coherence checks between VCF files and metadata +* [`vcf_check`](validation_overview.md#vcf-checks) - includes syntax validation and other checks on VCF files +${CODE_ROOT}/env/bin/upload_to_gcloud.py --input-file evidence.json.gz --destination-folder pharmacogenomics +* [`assembly_check`](validation_overview.md#assembly-checks) - includes all checks involving the FASTA file +* [`metadata_check`](validation_overview.md#metadata-check) - includes syntactic and semantic checks on metadata +* [`sample_check`](validation_overview.md#sample-name-concordance-check) - includes sample coherence checks between VCF files and metadata For example, to run only `vcf_check` and `sample_check`: ```shell diff --git a/docs/validation_overview.md b/docs/validation_overview.md index 66c0db5..ba54f79 100644 --- a/docs/validation_overview.md +++ b/docs/validation_overview.md @@ -7,11 +7,13 @@ The CLI tool performs the following validation checks and generates correspondin - Assembly check to ensure that the genome and the VCF match - Sample name check to ensure that the samples in the metadata can be associated with the sample in the VCF -In the following sections, we will examine each of these checks in detail, starting with the Metadata check. +In the following sections, we will examine each of these checks in detail. -## Metadata check +## Metadata Check -Once the user passes the metadata spreadsheet for validation checks, the eva-sub-cli tool verifies that all mandatory columns, marked in bold in the spreadsheet, are filled in. This data is crucial for further validation processes, such as retrieving the INDSC accession of the reference genome used to call the variants, and for sample and project metadata. If any mandatory columns or sheets are missing, the CLI tool will raise errors. +Once the user passes the metadata spreadsheet for validation checks, the eva-sub-cli tool verifies that all mandatory columns, marked in bold in the spreadsheet, are filled in. +This data is crucial for further validation processes, such as retrieving the INDSC accession of the reference genome used to call the variants, and for sample and project metadata. +If any mandatory columns or sheets are missing, the CLI tool will raise errors. Key points to note before validating your metadata spreadsheet with the eva-sub-cli tool: @@ -21,51 +23,66 @@ Key points to note before validating your metadata spreadsheet with the eva-sub- - Sample names in the spreadsheet must match those in the VCF file. - Analysis aliases must match across the sheets (Analysis, Sample, and File sheets). -Common Errors Seen with Metadata Checks: +Common errors seen with metadata checks: -- Analysis alias is not filled in for the respective samples in the Sample’s tab . +- Analysis alias is not filled in for the respective samples in the Sample tab. - Reference field is not filled with an INSDC accession. Submitters can sometimes use a non-GCA accession or generic assembly name as their reference genome. - Tax ID and the scientific name of the organism do not match. - Collection data and geographic location of the samples are not filled if the samples being submitted are novel. +Most issues around metadata will be reported in the "Metadata validation results" section of the report. +However, note that other validation failures may also require you to modify your metadata file. + ## VCF Checks -Ensuring data consistency upon submission is crucial for interoperability and supporting cross-study comparative genomics. Before accepting a VCF submission, the cli tool verifies that the submitted information adheres to the official VCF specifications. Additionally, submitted variants must be supported by either experimentally determined sample genotypes or population allele frequencies. +Ensuring data consistency upon submission is crucial for interoperability and supporting cross-study comparative genomics. +Before accepting a VCF submission, the CLI tool verifies that the submitted information adheres to the official VCF specifications. +Additionally, submitted variants must be supported by either experimentally determined sample genotypes or population allele frequencies. Key points to note before validating your VCF file with the eva-sub-cli tool: -- File Format Version: Always start the header with the version number (versions 4.1, 4.2, and 4.3 are accepted). +- File Format Version: Always start the header with the version number (versions 4.1-4 are accepted). - Header Metadata: Should include the reference genome, information fields (INFO), filters (FILTER), AF and genotype metadata - Variant Information: VCF files must provide either sample genotypes and/or aggregated sample summary-level allele frequencies. - Unique Variants: Variant lines should be unique and not specify duplicate loci. -- Reference Genome: All variants must be submitted with positions on a reference genome accessionned by a member of the INSDC consortium [Genbank](https://www.ncbi.nlm.nih.gov/genbank/), [ENA](https://www.ebi.ac.uk/ena/browser/home), or [DDBJ](https://www.ddbj.nig.ac.jp/index-e.html). +- Reference Genome: All variants must be submitted with positions on a reference genome accessioned by a member of the INSDC consortium: [Genbank](https://www.ncbi.nlm.nih.gov/genbank/), [ENA](https://www.ebi.ac.uk/ena/browser/home), or [DDBJ](https://www.ddbj.nig.ac.jp/index-e.html). -Common Errors Seen with VCF Checks: +Common errors seen with VCF checks: -- The VCF version is not one of 4.1, 4.2, or 4.3. +- The VCF version is not one of 4.1, 4.2, 4.3, or 4.4. - The VCF file contains extra spaces, blanks, or extra quotations causing validation to fail. Tools like bcftools can help verify the header before validating the file. - GT and AF fields are not defined in the header section. - VCF uses non-GCA contig alias -- The fields used do not conform to the official VCF specifications +- The fields used do not conform to the official VCF specifications + +Issues in VCF format validation will be reported in the "VCF validation results" section of the report. +Issues in determining evidence type (genotypes or allele frequencies) are reported per analysis at the top of the report. -## Assembly Check +## Assembly Checks -The EVA requires that all variants be submitted with an asserted position on an INSDC sequence. This means that the reference allele for every variant must match a position in a sequence that has been accessioned in either the GenBank or ENA database. Aligning all submitted data with INSDC sequences enables integration with other EMBL-EBI resources, including Ensembl, and is crucial for maintaining standardisation at the EVA. Therefore, all sequence identifiers in your VCF must match those in the reference FASTA file. +The EVA requires that all variants be submitted with an asserted position on an INSDC sequence. +This means that the reference allele for every variant must match a position in a sequence that has been accessioned in the GenBank, ENA, or DDBJ databases. +Aligning all submitted data with INSDC sequences enables integration with other EMBL-EBI resources, including Ensembl, and is crucial for maintaining standardisation at the EVA. +This also means that all sequence identifiers in your VCF must match those in the reference FASTA file. -Key points to note before validating your data with the eva-sub-cli Tool: +Key points to note before validating your data with the eva-sub-cli tool: - Ensure that the reference sequences in the FASTA file used to call the variants are accessioned in INSDC. -- Verify that the VCF file does not use non-GCA contig aliases by cross-checking with the reference assembly report. +- Verify that the contig names in the VCF file match those in the FASTA file. Common errors seen with assembly checks: -- VCF file uses a non-GCA contig alias causing the assembly check to fail -- Contigs used do not exist in the assembly report of the reference genome +- VCF file uses contig name not found in the FASTA file, causing the assembly check to fail. - Major Allele Used as REF Allele: This typically occurs when a specific version of Plink or Tassel is used to create VCF files, causing the tool to use the major allele as the reference allele. In such cases, submitters should use the GCA FASTA sequence to create corrected files. +Issues around reference allele matching in VCF files will be reported in the "VCF validation results" section of the report, +while issues around INSDC accessioning of the assembly will be reported in the "Reference genome INSDC check" section. + ## Sample Name Concordance Check -The sample name concordance check ensures that the sample names in the metadata spreadsheet match those in the VCF file. This is achieved by cross-checking the 'Sample name in VCF' column in the spreadsheet with the sample names registered in the VCF file. Any discrepancies must be addressed by the submitter when the CLI tool generates a report of the mismatches found. +The sample name concordance check ensures that the sample names in the metadata spreadsheet match those in the VCF file. +This is achieved by cross-checking the 'Sample name in VCF' column in the spreadsheet with the sample names registered in the VCF file. +Any discrepancies must be addressed by the submitter when the CLI tool generates a report of the mismatches found. Key points to note before validating your data with the eva-sub-cli tool: @@ -77,3 +94,5 @@ Common errors seen with sample concordance checks: - Link between “Sample” and “File” provided via the Analysis alias is not correctly defined in the metadata which causes the sample name concordance check to fail. - Extra white spaces in the sample names can lead to mismatches. - Case sensitivity issues between the sample names in the VCF file and the metadata spreadsheet. + +Issues in sample name concordance will be reported in the "Sample name concordance check" section of the report. \ No newline at end of file From 3d5a8b996fff5b08a98f2c66ab51b84e841b1e04 Mon Sep 17 00:00:00 2001 From: April Shen Date: Thu, 13 Nov 2025 13:28:15 +0000 Subject: [PATCH 2/3] minor edits --- docs/validation_overview.md | 29 ++++++++++++++--------------- 1 file changed, 14 insertions(+), 15 deletions(-) diff --git a/docs/validation_overview.md b/docs/validation_overview.md index ba54f79..0d187d4 100644 --- a/docs/validation_overview.md +++ b/docs/validation_overview.md @@ -5,7 +5,7 @@ The CLI tool performs the following validation checks and generates correspondin - Metadata check to ensure that the metadata fields have been correctly filled in - VCF check to ensure that the VCF file follows the VCF format specification - Assembly check to ensure that the genome and the VCF match -- Sample name check to ensure that the samples in the metadata can be associated with the sample in the VCF +- Sample name check to ensure that the samples in the metadata can be associated with the samples in the VCF In the following sections, we will examine each of these checks in detail. @@ -17,20 +17,20 @@ If any mandatory columns or sheets are missing, the CLI tool will raise errors. Key points to note before validating your metadata spreadsheet with the eva-sub-cli tool: -- Please do not change the existing structure of the spreadsheet +- Please do not change the existing structure of the spreadsheet. - Ensure all mandatory columns (marked in bold) are filled. -- Pre-registered project and samples must be released and not kept in private status +- Any pre-registered projects and samples must be released and not kept in private status. - Sample names in the spreadsheet must match those in the VCF file. - Analysis aliases must match across the sheets (Analysis, Sample, and File sheets). Common errors seen with metadata checks: - Analysis alias is not filled in for the respective samples in the Sample tab. -- Reference field is not filled with an INSDC accession. Submitters can sometimes use a non-GCA accession or generic assembly name as their reference genome. -- Tax ID and the scientific name of the organism do not match. -- Collection data and geographic location of the samples are not filled if the samples being submitted are novel. +- Reference field in the Analysis tab is not filled with an INSDC accession. Submitters should not use a non-GCA accession or generic assembly name as their reference genome. +- Taxonomy ID and the scientific name of the organism do not match for novel samples. +- Collection data and geographic location of the samples are not filled in for novel samples. -Most issues around metadata will be reported in the "Metadata validation results" section of the report. +Most issues around metadata will be reported in the "Metadata validation results" section of the validation report. However, note that other validation failures may also require you to modify your metadata file. ## VCF Checks @@ -41,7 +41,7 @@ Additionally, submitted variants must be supported by either experimentally dete Key points to note before validating your VCF file with the eva-sub-cli tool: -- File Format Version: Always start the header with the version number (versions 4.1-4 are accepted). +- File Format Version: Always start the header with the VCF version number (versions 4.1-4 are accepted). - Header Metadata: Should include the reference genome, information fields (INFO), filters (FILTER), AF and genotype metadata - Variant Information: VCF files must provide either sample genotypes and/or aggregated sample summary-level allele frequencies. - Unique Variants: Variant lines should be unique and not specify duplicate loci. @@ -52,10 +52,9 @@ Common errors seen with VCF checks: - The VCF version is not one of 4.1, 4.2, 4.3, or 4.4. - The VCF file contains extra spaces, blanks, or extra quotations causing validation to fail. Tools like bcftools can help verify the header before validating the file. - GT and AF fields are not defined in the header section. -- VCF uses non-GCA contig alias -- The fields used do not conform to the official VCF specifications +- The fields used do not conform to the official VCF specifications. -Issues in VCF format validation will be reported in the "VCF validation results" section of the report. +Issues in VCF format validation will be reported in the "VCF validation results" section of the validation report. Issues in determining evidence type (genotypes or allele frequencies) are reported per analysis at the top of the report. ## Assembly Checks @@ -75,13 +74,13 @@ Key points to note before validating your data with the eva-sub-cli tool: - VCF file uses contig name not found in the FASTA file, causing the assembly check to fail. - Major Allele Used as REF Allele: This typically occurs when a specific version of Plink or Tassel is used to create VCF files, causing the tool to use the major allele as the reference allele. In such cases, submitters should use the GCA FASTA sequence to create corrected files. -Issues around reference allele matching in VCF files will be reported in the "VCF validation results" section of the report, +Issues around reference allele matching in VCF files will be reported in the "VCF validation results" section of the validation report, while issues around INSDC accessioning of the assembly will be reported in the "Reference genome INSDC check" section. ## Sample Name Concordance Check The sample name concordance check ensures that the sample names in the metadata spreadsheet match those in the VCF file. -This is achieved by cross-checking the 'Sample name in VCF' column in the spreadsheet with the sample names registered in the VCF file. +This is achieved by cross-checking the "Sample name in VCF" column in the spreadsheet with the sample names registered in the VCF file. Any discrepancies must be addressed by the submitter when the CLI tool generates a report of the mismatches found. Key points to note before validating your data with the eva-sub-cli tool: @@ -91,8 +90,8 @@ Key points to note before validating your data with the eva-sub-cli tool: Common errors seen with sample concordance checks: -- Link between “Sample” and “File” provided via the Analysis alias is not correctly defined in the metadata which causes the sample name concordance check to fail. +- Link between "Sample" and "File" provided via the Analysis alias is not correctly defined in the metadata which causes the sample name concordance check to fail. - Extra white spaces in the sample names can lead to mismatches. - Case sensitivity issues between the sample names in the VCF file and the metadata spreadsheet. -Issues in sample name concordance will be reported in the "Sample name concordance check" section of the report. \ No newline at end of file +Issues in sample name concordance will be reported in the "Sample name concordance check" section of the validation report. \ No newline at end of file From e42d717fc715cfb6a581fe3ae8e4b014b532b971 Mon Sep 17 00:00:00 2001 From: April Shen Date: Thu, 20 Nov 2025 10:18:33 +0000 Subject: [PATCH 3/3] address review comments --- docs/validation_overview.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/docs/validation_overview.md b/docs/validation_overview.md index 0d187d4..8eaa09a 100644 --- a/docs/validation_overview.md +++ b/docs/validation_overview.md @@ -13,22 +13,28 @@ In the following sections, we will examine each of these checks in detail. Once the user passes the metadata spreadsheet for validation checks, the eva-sub-cli tool verifies that all mandatory columns, marked in bold in the spreadsheet, are filled in. This data is crucial for further validation processes, such as retrieving the INDSC accession of the reference genome used to call the variants, and for sample and project metadata. + +Each tab in the spreadsheet has an associated help tab, which provides detailed instructions on how to fill in your metadata correctly. If any mandatory columns or sheets are missing, the CLI tool will raise errors. Key points to note before validating your metadata spreadsheet with the eva-sub-cli tool: - Please do not change the existing structure of the spreadsheet. - Ensure all mandatory columns (marked in bold) are filled. +- Columns marked in green indicate an either/or requirement, so only one of the sections should be filled in. - Any pre-registered projects and samples must be released and not kept in private status. - Sample names in the spreadsheet must match those in the VCF file. - Analysis aliases must match across the sheets (Analysis, Sample, and File sheets). +- Use the Hold Date field in the Project sheet for data that needs to be kept under embargo. Common errors seen with metadata checks: - Analysis alias is not filled in for the respective samples in the Sample tab. - Reference field in the Analysis tab is not filled with an INSDC accession. Submitters should not use a non-GCA accession or generic assembly name as their reference genome. - Taxonomy ID and the scientific name of the organism do not match for novel samples. -- Collection data and geographic location of the samples are not filled in for novel samples. +- Collection date and geographic location of the samples are not filled in for novel samples. +- Date fields do not follow the YYYY-MM-DD format. +- Custom values are used for controlled vocabulary fields (indicated with a drop-down menu). Submitters should select from the values provided or contact us if the required value is not present. Most issues around metadata will be reported in the "Metadata validation results" section of the validation report. However, note that other validation failures may also require you to modify your metadata file.