Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 52 additions & 1 deletion docs/tools/odm-sdk/terminal/study/uploading-study.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ odm-import-data -h
- `-sm,--samples`: URL of the samples file or accession of existing samples file to be linked
- `-lb, --libraries`: URL of the libraries file or accession of existing libraries file to be linked
- `-pr, --preparations`: URL of hosted preparations file or accession of existing preparations file to be linked
- `-c, --cell`: URL of hosted cell metadata file or accession of existing cell file to be linked
- `-e,--expression`: URL of any tabular data file (not only expression data) except Gene Variant or Flow Cytometry
- `-em,--expression-metadata`: URL of any tabular metadata file (not only expression data) except Gene Variant or Flow Cytometry
- `-v, --variant`: URL of the variants data file
Expand Down Expand Up @@ -73,7 +74,7 @@ Additional optional parameters:

## Data model

The script supports 2 data models:
The script supports several data models:
![Data Model](uploading-study/data-model.png)

- Study - Samples - Omics data:
Expand All @@ -82,6 +83,9 @@ The script supports 2 data models:
- the script uses this data model if parameters for libraries or preparations loading are specified;
- omics data can be linked only to libraries or preparations;
- only expression data (the parameters --expression and --expression-metadata) is supported.
- Study - Samples - (optional: Libraries/Preparations) - Cell metadata - Omics data:
- the script uses this data model if parameter for cell metadata loading is specified;
- expression data can be linked to cell metadata;

The script works sequentially, linking the object with the previous one according to the data model. Below you can find
examples to demonstrate different combinations:
Expand Down Expand Up @@ -141,6 +145,25 @@ odm-import-data --token [token] -H [HOST] \
- `preparations_1` will be linked to `samples_2`
- `expression_1` will be linked to `preparations_1`

### _Example 4_

```shell
odm-import-data --token [token] -H [HOST] \
--study http://data_source/study.csv \
--samples http://data_source/samples_1.csv \
--samples http://data_source/samples_2.csv \
--libraries http://data_source/libraries_1.csv \
--cell http://data_source/cell_1.csv \
--expression http://data_source/expression_1.gct \
--expression-metadata http://data_source/expression_metadata_1.gct.tsv
```

- `samples_1` will be linked to `study`
- `samples_2` will be linked to `study`
- `libraries_1` will be linked to `samples_2`
- `cell_1` will be linked to `libraries_1`
- `expression_1` will be linked to `cell_1`

## Link all to all

The `-lata` parameter allows to bypass the restriction of sequential linking of objects. The behaviour of the script
Expand Down Expand Up @@ -355,3 +378,31 @@ odm-import-data --token [token] -H [HOST] \
--samples http://data_source/arabidopsis_sample_metadata_uncurated.tsv \
--expression http://data_source/arabidopsis.gct
```

### Study with single cell data

For working with Cell metadata and Cell expression use the following example files:

- [Study_metadata](https://bio-test-data.s3.us-east-1.amazonaws.com/User_guide_test_data/Single_cell_data/study_metadata.tsv), a tab-delimited file of the study attributes
- [Samples_metadata](https://bio-test-data.s3.us-east-1.amazonaws.com/User_guide_test_data/Single_cell_data/samples.tsv), a tab-delimited file of sample attributes
- [Cell_metadata](https://bio-test-data.s3.us-east-1.amazonaws.com/User_guide_test_data/Single_cell_data/cells_2_samples_full_match.tsv), a tab-delimited file of cell attributes
- [Cell_expression](https://bio-test-data.s3.us-east-1.amazonaws.com/User_guide_test_data/Single_cell_data/expression_2_cells_linked_to_samples.tsv), a tab-delimited file of cell expression data

Run the script with the above by typing the following (inserting your token
instead of [token], note you may need to escape or quote strings depending on
your specific command line interface):

Script example (Study → Samples → Cells → Expression)

```default
odm-import-data \
--server <HOST> \
--token <TOKEN> \
--study 's3://bio-test-data/User_guide_test_data/Single_cell_data/study_metadata.tsv' \
--samples 's3://bio-test-data/User_guide_test_data/Single_cell_data/samples.tsv' \
--cells 's3://bio-test-data/User_guide_test_data/Single_cell_data/cells_2_samples_full_match.tsv' \
--expression 's3://bio-test-data/User_guide_test_data/Single_cell_data/expression_2_cells_linked_to_samples.tsv' \
--data-class 'Single-cell transcriptomics' \
--number-of-feature-attributes 1 \
--allow-duplicates
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
85 changes: 78 additions & 7 deletions docs/user-guide/doc-odm-user-guide/import-data-using-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,12 @@ You can import studies, samples, and any data in the tabular format:

- **Study**: the context of an experiment, such as the aim and statistical design.
- **Sample**: the biological attributes of a sample, such as tissue, disease, and treatment.
- **Data**: Includes transcriptomics, proteomics, gene variant, flow cytometry data, and more. You can import the metadata (e.g. genome version, normalization
method, and the locations of raw/processed data in your storage) together with the processed data (e.g. expression counts, genotypes).
- **Cross-reference mapping**: a list of transcript and gene ids and how they map to each other.
- **Libraries metadata**: TSV file describing sequencing libraries or other indexable data types. It includes information on library preparation, type (e.g., single-end or paired-end), protocol, barcodes, and platform.
- **Preparations metadata**: metadata describing how samples were prepared prior to data generation, applicable to proteomics, transcriptomics, and other data types.
- **Cell metadata**: all the information stored per cell (per barcode) that describes that cell and its context, separate from the actual molecular measurements (like the gene expression counts matrix which should be uploaded as expression within the ODM)
- **Data**: Includes transcriptomics, proteomics, gene variant, flow cytometry data, cell expression, and more. You can import the metadata (e.g. genome version, normalization
method, and the locations of raw/processed data in your storage) together with the processed data (e.g. expression counts, genotypes).
- **Cross-reference mapping**: a list of transcript and gene ids and how they map to each other.
- **Attached Files**: Supplement your study by attaching related research materials like PDF, XLSX, DOCX, PPTX files, images, and more. Please note, contents of these attached files won't be indexed or made searchable.

Once imported, studies, samples, and data metadata will be queryable and editable from both the User Interface and APIs, whilst the signal data will only be queryable via APIs.
Expand All @@ -35,17 +36,18 @@ Importing data has two stages. First, you import studies, samples, and data sepa

The **Sample Source ID** is used as the default linking key. You can choose another attribute from the template for linking data to samples. The data model and how it looks in the User Interface is shown below.

In addition to core data types, **Libraries** and **Preparations** require special handling. These files must include the **Sample Source ID**, which is used to link them to the appropriate samples.
In addition to core data types, **Libraries**, **Preparations**, **Cell metadata** require special handling. These files must include the **Sample Source ID**, which is used to link them to the appropriate samples.

The correct order of linking follows the system logic and available endpoints:

- **Samples** are linked to a **Study**
- **Libraries** and **Preparations** are linked to **Samples**
- **Omics data** (e.g. transcriptomics, proteomics) are linked to **Samples**, or to **Libraries/Preparations** depending on the data type
- **Cell metadata** is linked to **Samples** or **Libraries** or **Preparations**
- **Omics data** (e.g. transcriptomics, proteomics, cell expression) are linked to **Samples**, or to **Libraries/Preparations**, or to **Cell metadata** depending on the data type
- **Attached files** are linked directly to a **Study**


![image](doc-odm-user-guide/images/data-model+metainfo-editor.png)
![image](doc-odm-user-guide/images/data-model.png)
## Data Loading via APIs
To load the data via APIs each entity is created via a separate endpoint specific for
this data type. Then they are sequentially linked in the Integration layer.
Expand Down Expand Up @@ -271,6 +273,51 @@ As soon as the import process will be completed, you will be able to get the pre
}
}
```

### Import Cell metadata

For working with Cell metadata and Cell expression use the following example files:

- [Study_metadata](https://bio-test-data.s3.us-east-1.amazonaws.com/User_guide_test_data/Single_cell_data/study_metadata.tsv), a tab-delimited file of the study attributes
- [Samples_metadata](https://bio-test-data.s3.us-east-1.amazonaws.com/User_guide_test_data/Single_cell_data/samples.tsv), a tab-delimited file of sample attributes
- [Cell_metadata](https://bio-test-data.s3.us-east-1.amazonaws.com/User_guide_test_data/Single_cell_data/cells_2_samples_full_match.tsv), a tab-delimited file of cell attributes
- [Cell_expression](https://bio-test-data.s3.us-east-1.amazonaws.com/User_guide_test_data/Single_cell_data/expression_2_cells_linked_to_samples.tsv), a tab-delimited file of cell expression data

To import Cell metadata, you will need to use `POST /api/v1/jobs/import/cells` endpoint:

```default
curl -X 'POST' \
'https://<HOST>/api/v1/jobs/import/cells?allow_dups=false' \
-H 'accept: application/json' \
-H 'Genestack-API-Token: <TOKEN>' \
-H 'Content-Type: application/json' \
-d '{
"dataLink": "https://bio-test-data.s3.us-east-1.amazonaws.com/User_guide_test_data/Single_cell_data/cells_2_samples_full_match.tsv"
```

Similar to the previous step, you should see the **jobExecId** in the response:

```json
{
"jobExecId": 24,
"startedBy": "job@genestack.com",
"jobName": "IMPORT_CELLS",
"status": "COMPLETED",
"createTime": "2026-02-05 11:35:36",
"endTime": "2026-02-05 11:35:38"
}
```
As soon as the import process will be completed, you will be able to get the Cell metadata **groupAccession** by querying the **jobExecId** in `GET /api/v1/jobs/{jobExecId}/output` endpoint:

```json
{
"status": "COMPLETED",
"result": {
"groupAccession": "GSF016786"
}
}
```

### Linking entities

#### Samples to Study
Expand Down Expand Up @@ -328,6 +375,31 @@ If successful you will see a preparation tab appear in the Metadata Editor:

![image](doc-odm-user-guide/images/preparation-added.png)

#### Cell metadata to Samples/Libraries/Preparations

You can link the **Cell metadata group** to the **samples/libraries/preparation groups** using the endpoints:

* Link to Samples

**Path:** POST `/api/v1/as-curator/integration/link/cell/group/{sourceId}/to/sample/group/{targetId}`

* Link to Libraries

**Path:** POST `/api/v1/as-curator/integration/link/cells/group/{sourceId}/to/library/group/{targetId}`

* Link to Preparations

**Path:** POST `/api/v1/as-curator/integration/link/cells/group/{sourceId}/to/preparation/group/{targetId}`

For `sourceId` field provide accession of your Cell metadata group.

For `targetId` field provide accession of selected Sample, Library, or Preparation group where Cell metadata should be linked.

Cell metadata will be linked if there are matches between `batch` values in Cell metadata and `Sample Source ID` for Samples,
`Library ID` for Libraries, and `Preparation ID` for Preparations.

If successful you will find the Cells via `GET /api/v1/as-curator/omics/cells` API endpoint when Study accession is provided for `studyQuery` parameter.

### Working with the jobExecId
The following endpoints allow you to manage and inspect jobs using the jobExecId, which is returned after initiating an asynchronous import task.

Expand Down Expand Up @@ -937,4 +1009,3 @@ Example response:
]
}
```

Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ to be able to import and edit data in ODM.

Read the full list of requirements [here](../../../tools/odm-sdk/terminal/study/uploading-study/#requirements)

## Optional experimental (signal) data files
## Optional files

You can optionally also provide:

Expand All @@ -17,7 +17,8 @@ You can optionally also provide:
- The server address if you want to apply the script to a different ODM server.
Use `--host <HOST>` to specify.
- Any data in the Tabular format (Data Frame) as a TSV, hosted at an HTTPS web address
- Gene expression data in [GCT](https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GCT:_Gene_Cluster_Text_file_format_.28.2A.gct.29) format, hosted at an HTTPS web address
- Gene expression data in [GCT](https://docs.gsea-msigdb.org/#GSEA/Data_Formats/#gct-gene-cluster-text-file-format-gct) format, hosted at an HTTPS web address
- Gene expression or Cell expression data in TSV format, hosted at an HTTPS web address
- Gene expression metadata in TSV format, hosted at an HTTPS web address
- Gene variant data in [VCF](https://samtools.github.io/hts-specs/VCFv4.2.pdf) format, hosted at an HTTPS web address
- Gene variant metadata in TSV format, hosted at an HTTPS web address
Expand All @@ -29,9 +30,10 @@ You can optionally also provide:
- A libraries file in TSV format, hosted at an HTTPS web address, or the
accession of an existing library file
- A preparations file in TSV format, hosted at an HTTPS web address, or the
accession of an existing preparations file.
accession of an existing preparations file
- A Cell metadata file in TSV format, hosted at an HTTPS web address

Once imported, studies, samples, and signal metadata will be queryable and
Once imported, studies, samples, libraries, preparations, cells metadata, and signal metadata will be queryable and
editable from both the User Interface and APIs, whilst the signal data will
only queryable via APIs.

Expand Down Expand Up @@ -87,6 +89,11 @@ Optionally include data files by appending any or all of the following to the ab
```default
--preparations [URL]
```

```default
--cell [URL]
```

## Importing Multiple Tabular Files

- [Test_basic_generic_expression.tsv](https://bio-test-data.s3.us-east-1.amazonaws.com/odm/user-guide/Test_basic_generic_expression.tsv), a tab-separated file containing tabular expression data with two text features and two numeric features, followed by expression values for four samples.
Expand Down Expand Up @@ -156,16 +163,38 @@ accessions must be supplied. See the example below:
The following are some example files to illustrate file formats:

- [Test_1000g.study.tsv](https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.study.tsv), a tab-delimited file of the study attributes
- [Test_1000g.samples.tsv](https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.samples.tsv), a tab-delimited file of sample attributes.
- [Test_1000g.gct](https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.gct), a [GCT](https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GCT:_Gene_Cluster_Text_file_format_.28.2A.gct.29) file of expression data from multiple sequencing runs
- [Test_1000g.samples.tsv](https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.samples.tsv), a tab-delimited file of sample attributes
- [Test_1000g.gct](https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.gct), a [GCT](https://docs.gsea-msigdb.org/#GSEA/Data_Formats/#gct-gene-cluster-text-file-format-gct) file of expression data from multiple sequencing runs
- [Test_1000g.gct.tsv](https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.gct.tsv), a tab-separated file that describes the expression data
- [Test_1000g.vcf](https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.vcf), a [VCF](https://samtools.github.io/hts-specs/VCFv4.2.pdf) file of variant data from multiple sequencing runs
- [Test_1000g.vcf.tsv](https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.vcf.tsv), a tab-separated file that describes the variant data

For working with Cell metadata and Cell expression use the following example files:

- [Study_metadata](https://bio-test-data.s3.us-east-1.amazonaws.com/User_guide_test_data/Single_cell_data/study_metadata.tsv), a tab-delimited file of the study attributes
- [Samples_metadata](https://bio-test-data.s3.us-east-1.amazonaws.com/User_guide_test_data/Single_cell_data/samples.tsv), a tab-delimited file of sample attributes
- [Cell_metadata](https://bio-test-data.s3.us-east-1.amazonaws.com/User_guide_test_data/Single_cell_data/cells_2_samples_full_match.tsv), a tab-delimited file of cell attributes
- [Cell_expression](https://bio-test-data.s3.us-east-1.amazonaws.com/User_guide_test_data/Single_cell_data/expression_2_cells_linked_to_samples.tsv), a tab-delimited file of cell expression data

Run the script with the above by typing the following (inserting your token
instead of [token], note you may need to escape or quote strings depending on
your specific command line interface):

```default
odm-import-data --token [token] --host [HOST] --study https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.study.tsv --samples https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.samples.tsv --expression https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.gct --expression_metadata https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.gct.tsv --variant https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.vcf --variant_metadata https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.vcf.tsv
```

Script example (Study → Samples → Cells → Expression)

```default
odm-import-data \
--server <HOST> \
--token <TOKEN> \
--study 's3://bio-test-data/User_guide_test_data/Single_cell_data/study_metadata.tsv' \
--samples 's3://bio-test-data/User_guide_test_data/Single_cell_data/samples.tsv' \
--cells 's3://bio-test-data/User_guide_test_data/Single_cell_data/cells_2_samples_full_match.tsv' \
--expression 's3://bio-test-data/User_guide_test_data/Single_cell_data/expression_2_cells_linked_to_samples.tsv' \
--data-class 'Single-cell transcriptomics' \
--number-of-feature-attributes 1 \
--allow-duplicates
```
Loading