feat!(backend): refactor multi-segment submission (2/n)#5398
feat!(backend): refactor multi-segment submission (2/n)#5398anna-parker merged 18 commits intoedit-page-anyafrom
Conversation
f27aa67 to
fc9ec01
Compare
12b539b to
f8df4fa
Compare
eec24bb to
f8df4fa
Compare
|
@codex review |
This comment was marked as outdated.
This comment was marked as outdated.
ffac38a to
87fbe02
Compare
|
Nitpick: IMO the commit message shouldn't be "refactor" but instead describe that |
Could you please explain why we need the I think initially we agreed to still stick to the pattern |
I think the |
fc9ec01 to
9b06918
Compare
| val metadataFastaIds = uploadDatabaseService.getFastaIdsForMetadata(uploadId) | ||
| val metadataFastaIdsSet = metadataFastaIds.flatten().toSet() | ||
| if (metadataFastaIdsSet.size < metadataFastaIds.flatten().size) { | ||
| throw UnprocessableEntityException("Metadata file contains duplicate fastaIds.") | ||
| } |
There was a problem hiding this comment.
Put into validate function similar to below for submission ids
backend/src/main/kotlin/org/loculus/backend/model/SubmitModel.kt
Outdated
Show resolved
Hide resolved
...nd/src/main/kotlin/org/loculus/backend/service/submission/dbtables/SequenceUploadAuxTable.kt
Show resolved
Hide resolved
backend/src/main/kotlin/org/loculus/backend/service/submission/UploadDatabaseService.kt
Outdated
Show resolved
Hide resolved
resolves #4847 ### Screenshot Improves #4821, comes after #5398 You can use pathoplexus/dev_example_data#2 for testing. Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. ## Prepro config changes Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. ### PR Checklist - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping ## Future Work - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://sort-multi-path.loculus.org
resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 ### BREAKING CHANGES When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. ### Testing You can use pathoplexus/dev_example_data#2 for testing. ### Prepro config changes Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. ### PR Checklist - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping ## Future Work - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. You can use pathoplexus/dev_example_data#2 for testing. Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. You can use pathoplexus/dev_example_data#2 for testing. Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. You can use pathoplexus/dev_example_data#2 for testing. Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. You can use pathoplexus/dev_example_data#2 for testing. Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. You can use pathoplexus/dev_example_data#2 for testing. Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. You can use pathoplexus/dev_example_data#2 for testing. Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. You can use pathoplexus/dev_example_data#2 for testing. Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. You can use pathoplexus/dev_example_data#2 for testing. Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. You can use pathoplexus/dev_example_data#2 for testing. Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. You can use pathoplexus/dev_example_data#2 for testing. Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. You can use pathoplexus/dev_example_data#2 for testing. Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. You can use pathoplexus/dev_example_data#2 for testing. Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
resolves #4708, #4734 partially resolves #5392, #5185 (comment) Builds on #5382 When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaId` column with a space (or comma) -separated list of the `fastaIds` (fasta header IDs) of the respective sequences. If no `fastaId` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. You can use pathoplexus/dev_example_data#2 for testing. Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://multi-segment-submission.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
… refactor multi segment submission in backend and edit page and have prepro assign segments (#5382) resolves #4999 #4708, #4734, #5511 partially resolves #5392, #5185 (comment) includes work done in #5398 and #5402 This PR additionally fixes submission, subtype assignment and search for EVs and other multi-path organisms. ### BREAKING CHANGES When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaIds` column with a space -separated list of the `fastaId`s (fasta header IDs) of the respective sequences. If no `fastaIds` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort (uses a minimizer index for fast local alignment) or nextclade align (full sequence alignment to reference) will be used to assign segments/subtypes for all multi-segmented and multi-pathogen sequences (this is also done in ingest for grouping segments): ``` segment_classification_method: "minimizer" or "align" minimizer_url: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format `<submissionId>_<segmentName>` (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaId in the processedData, the map is called: `sequenceNameToFastaId`. This allows us to surface the segment assignment on the edit page. ### Nextclade Preprocessing pipeline config changes Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a dictionary where each item includes all information required to run nextclade. I.e. we change from: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` to: ``` nextclade_sequence_and_datasets: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name and name are used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > genes: [RdRp] - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M genes: [GPC] - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S genes: [NP] nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output segment_classification_method: <optional, default for multi segmented viruses is align - if you assign segments in ingest for grouping use the same option here as you use there e.g. "minimizer" or "align"> minimizer_url: <optional, url_to_minimizer_index_used_by_nextclade_sort> ``` ### Ingest Pipeline Config changes `minimizer_index` is changed to `minimizer_url` for consistency (can be used in ingest and preprocessing and should both be the same) ### Optional additional Config changes Limit the number of sequences the backend will accept per submission by using - should be added for multi-segmented organisms: ` submissionDataTypes: &defaultSubmissionDataTypes consensusSequences: true maxSequencesPerEntry: 1 ` ### Testing You can use pathoplexus/example_data#16 and pathoplexus/dev_example_data#2 for testing. ### PR Checklist - [x] Update values.schema.json and other READMEs - [x] add fastaId to commonMetadata (ensure it is downloaded in templates): #5561 - [x] Fix how genes are returned (will cause a config update): #5563 - [x] Improve prepro code (less duplication and more tests): #5554 - [x] ingest EVs as single segmented to ensure search works: #5511 - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - ~add integration testing for full EV submission user journey~ -> will be done in a later PR - [x] improve CCHF minimizer (some segments are again not assigned) - [x] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) -> decided against - [x] update PPX docs with new multi-segment submission format -> test PR here: pathoplexus/pathoplexus#759 - [x] update example data for demo 🚀 Preview: https://edit-page-anya.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com> Co-authored-by: Fabian Engelniederhammer <92720311+fengelniederhammer@users.noreply.github.com> Co-authored-by: Theo Sanderson <theo@sndrsn.co.uk>
resolves #4708, #4734
partially resolves #5392, #5185 (comment)
Builds on #5382
BREAKING CHANGES
When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional
fastaIdcolumn with a space (or comma) -separated list of thefastaIds(fasta header IDs) of the respective sequences. If nofastaIdcolumn is supplied thesubmissionIdwill be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadatasubmissionIdtofastaId.This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings)
Nextclade sort will be used to assign segments/subtypes for all aligned sequences:
For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format _ (as in current set up).
As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData:
sequenceNameToFastaHeaderMap. This allows us to surface this assignment on the edit page.Testing
You can use pathoplexus/dev_example_data#2 for testing.
Prepro config changes
Instead of having a dictionary for the nextclade datasets and servers we make
nucleotideSequencesa list of sequences:Note the templates now also generate the genes list from the merged config.
PR Checklist
Future Work
🚀 Preview: https://multi-segment-submission.loculus.org