Conversation
MikhailAf
commented
Apr 3, 2025
|
|
||
| Before importing a new reference genome, users are encouraged to **check which reference genomes are already available** in the system. This helps avoid duplication and ensures consistency across datasets. Users can: | ||
|
|
||
| * Browse existing reference genomes in the **File Browser** (under the Reference Genomes category), or |
There was a problem hiding this comment.
File Browser -> File Manager
|
|
||
| ### **Required File Format** | ||
|
|
||
| If the reference genome needed is not listed in the **File Browser** or returned by the `GET /api/v1/reference-genomes` endpoint, users can import a custom reference genome into ODM to support their dataset. |
There was a problem hiding this comment.
File Browser -> File Manager
|
|
||
| This response confirms successful import and provides a unique **accession ID**. | ||
|
|
||
| The newly imported reference genome is now available in ODM and visible in the File Manager. |
There was a problem hiding this comment.
I believe we have to also mention that this imported reference genome will be available only after successful initialisation, otherwise it will be useless
|
|
||
| ### **Preparing Metadata** | ||
|
|
||
| To upload VCF files, you must also provide a metadata file in TSV (tab-separated values) format. This file should include at least the following fields: |
There was a problem hiding this comment.
you must also provide a metadata file in TSV
Let's maybe mention that it's needed in this particular case to mention reference genome information. I'm a bit confused and I'd like to add more details here because in general metadata file can be skipped
There was a problem hiding this comment.
This file should include at least the following fields:
Why? No, to mention what reference genome should be used user have to have in the metadata file either attribute Genome Version or attribute genestack.bio:organism
|
|
||
| To upload VCF files, you must also provide a metadata file in TSV (tab-separated values) format. This file should include at least the following fields: | ||
|
|
||
| * **Genome Version**: The exact name of the reference genome as it appears in ODM |
There was a problem hiding this comment.
Genome Version may contain one of two variables:
assembly: in case of multiple releases (for example, 100 and 109) a link with the latest (109) release will be created;
OR
name: a link with the exact release will be created
|
|
||
| Additional optional fields, such as **Version**, **Accession**, or **User**, may also be included and will not interfere with the upload. The system is flexible and accepts metadata files with varying numbers of columns. | ||
|
|
||
| !!! note "Metadata file examples" |
There was a problem hiding this comment.
I'm confused about these examples and suggest to use information from this old article https://genestack.atlassian.net/wiki/spaces/~940367389/pages/3417047043/Working+with+Reference+genomes+version+ODM+1.53#Examples to show what options user has and how they can be used. Amount of other columns is unnecessary information when we tell about reference genomes.
|
|
||
| As with other data types, the request should include: | ||
|
|
||
| * A **metadata file** with information about the reference genome and organism |
There was a problem hiding this comment.
and organism
can be removed, it's necessary to provide information about organism
|
|
||
| * A **metadata file** with information about the reference genome and organism | ||
| * A **VCF file** compressed **.vcf.gz** or plain **.vcf** (See example of a [VCF file](https:///s3.amazonaws.com/bio-test-data/gVCF_Mm_Demo.vcf)) | ||
| * A **link structure** connecting the data to samples, libraries, or preparations |
There was a problem hiding this comment.
Why? we don't provide this information in the body of job endpoint to import vcf data
| * A **link structure** connecting the data to samples, libraries, or preparations | ||
|
|
||
| !!! note "Important" | ||
| Unlike transcriptomics or flow cytometry data, **a reference genome must be specified** when importing VCF files. If no metadata is provided, the system defaults to using the **human reference genome (GRCh38)**. To use a different genome, you must include a metadata file where the **Genome Version** matches the name of a **previously imported custom reference genome** in your ODM instance. |
There was a problem hiding this comment.
"must be specified" -> "can be specified"? Maybe for our client is okay to use default reference genome
| * **organism**, **assembly**, **release**: Core genome attributes | ||
| * **annotationUrl**: Link to the annotation file used (e.g., GTF from Ensembl) | ||
| * **genestack:accession**: ODM accession for the reference genome | ||
| * **initializationStatus**: Should be COMPLETE if the genome is ready for use |
There was a problem hiding this comment.
Should be COMPLETE if the genome is ready for use
If I remember correctly...user won't be able to import vcf file with the metadata file where unsuccessfully initialised reference genome is mentioned