Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
113 commits
Select commit Hold shift + click to select a range
39c2a37
Update group-label semantics
stschiff Dec 17, 2024
083534a
added gzip to the schema
stschiff Jan 16, 2025
d88b213
Updated PDF via Quarto
TCLamnidis Jan 21, 2025
f1c89c0
capitalised MUST for gzip file ending
stschiff Jan 31, 2025
0e9dda4
added VCF
stschiff Feb 5, 2025
4c2697f
made Collection_ID a list column
stschiff Feb 5, 2025
30c1f52
Update poseidon_package_specification.pdf
TCLamnidis Feb 10, 2025
e8f7dda
Merge pull request #91 from poseidon-framework/collection_id_list
nevrome Feb 23, 2025
97ca9cc
Merge branch 'master' into dev
nevrome Feb 23, 2025
e3f5448
started the next changelog entry
nevrome Feb 23, 2025
23e0172
Merge pull request #86 from poseidon-framework/group-semantics
nevrome Feb 23, 2025
b026503
changelog
nevrome Feb 23, 2025
2acc9d3
added missing pipe to table
stschiff Feb 24, 2025
d4e9a4b
added Custodian_Institution column
nevrome Feb 24, 2025
bb95059
added two columns for chromosomal anomalies
nevrome Feb 24, 2025
d98daa1
Update janno_columns.tsv
stschiff Mar 7, 2025
04c1d36
Update janno_columns.tsv
stschiff Mar 7, 2025
8864aaa
Merge pull request #95 from poseidon-framework/chromosomalAnomaly
stschiff Mar 7, 2025
09a1366
Merge pull request #94 from poseidon-framework/curatingInstitution
stschiff Mar 7, 2025
de5c143
proposed two new .janno columns for cultural affiliation
nevrome Mar 18, 2025
fbf8f54
transformed % values to 0-1 fractions
nevrome Mar 18, 2025
d47f501
replaced Source_Tissue with Source_Material and Source_Material_Note
nevrome Mar 18, 2025
cb4a993
removed Capture_Type ReferenceGenome
nevrome Mar 18, 2025
e1de8ea
updated the changelog
nevrome Mar 18, 2025
b864bf0
Update janno_columns.tsv
nevrome Mar 27, 2025
8f7acb8
Update janno_columns.tsv
nevrome Mar 27, 2025
9d8521a
update of changelog
nevrome Mar 27, 2025
fd687b9
Merge branch 'dev' into zeroone
nevrome Mar 27, 2025
5b40fd6
Merge pull request #99 from poseidon-framework/zeroone
nevrome Mar 27, 2025
e84f601
Merge branch 'dev' into noRefGenome
nevrome Mar 31, 2025
bba6ed9
changelog
nevrome Mar 31, 2025
f6cb438
Merge pull request #101 from poseidon-framework/noRefGenome
nevrome Mar 31, 2025
f84f08c
Merge branch 'dev' into archContext
nevrome Mar 31, 2025
e1161e8
proposal for _URL columns for cultural eras and archaeological cultures
nevrome Mar 31, 2025
d6a5f77
Merge branch 'dev' into sourceMaterial
nevrome Mar 31, 2025
c5a524b
changelog
nevrome Mar 31, 2025
02a1388
Merge pull request #100 from poseidon-framework/sourceMaterial
nevrome Mar 31, 2025
4081709
proposal for the damage column
nevrome Mar 31, 2025
70149b2
Merge pull request #102 from poseidon-framework/multiDamage
nevrome Apr 9, 2025
6453f11
made some changes to make the format more species-agnostic
stschiff Apr 14, 2025
30748f7
Merge branch 'dev' into species_agnostic
stschiff Apr 14, 2025
c9fbf1e
added species as Janno field
stschiff Apr 14, 2025
56c88a5
added reference genome columns
stschiff Apr 14, 2025
dba7e5f
m
stschiff Apr 14, 2025
dfff67e
Merge branch 'dev' into archContext
nevrome Apr 15, 2025
ce8e534
changelog
nevrome Apr 15, 2025
413ac3a
Merge pull request #98 from poseidon-framework/archContext
nevrome Apr 15, 2025
b4f2662
Merge branch 'dev' into species_agnostic
nevrome Apr 15, 2025
ea68165
changelog
nevrome Apr 15, 2025
5121656
Add submitted_md5 column
TCLamnidis Apr 15, 2025
23caced
Update changelog.md
TCLamnidis Apr 17, 2025
06d877a
Fix section header
TCLamnidis Apr 17, 2025
4cc2908
Merge pull request #104 from poseidon-framework/add_ssf_column
TCLamnidis Apr 17, 2025
3a174ec
added some more info for the genotype file formats
stschiff May 12, 2025
878d612
moved Species and Reference Genome information to yaml
stschiff May 12, 2025
ff4f52d
updated changelog
stschiff May 13, 2025
802545d
Merge pull request #89 from poseidon-framework/gzip-support
stschiff May 13, 2025
cc88d7b
moved Species back to the Janno
stschiff May 13, 2025
7850463
updated Changelog
stschiff May 13, 2025
97dd0d9
Merge branch 'dev' into species_agnostic
stschiff May 13, 2025
327c001
Merge pull request #103 from poseidon-framework/species_agnostic
stschiff May 13, 2025
43a6137
another update of the changelog
nevrome Jun 9, 2025
8bd393a
added Source_Material category hair as requested by @Mattists
nevrome Jul 2, 2025
a06abe8
started to think about the allowed characters for Poseidon_ID and Gro…
nevrome Jul 2, 2025
c43576e
simpler, clearer definition of the ASCII limitation, now also in the …
nevrome Jul 6, 2025
6e9850a
recommended the use of LF over CRLF
nevrome Jul 6, 2025
d33e6cf
Merge pull request #106 from poseidon-framework/CRLF2
nevrome Jul 16, 2025
f9d66d7
Merge branch 'dev' into ASCII
nevrome Jul 16, 2025
6736eaf
Merge pull request #105 from poseidon-framework/ASCII
nevrome Jul 16, 2025
59e4a21
changelog update
nevrome Jul 16, 2025
f51879b
update of README title
nevrome Jul 16, 2025
31900ef
note fields should not be list-columns
nevrome Aug 28, 2025
a256a62
moved species column below the mandatory columns
nevrome Aug 28, 2025
844d057
added information on group names and genetic sex headers in VCF
stschiff Sep 5, 2025
b0f3ffa
add Individual_ID as mandatory column
stschiff Sep 5, 2025
2fd2807
made Individual_ID optional again in README
stschiff Sep 9, 2025
ecfe343
Update Individual_ID description in janno_columns.tsv to make optional
stschiff Sep 9, 2025
4453b46
removed _Note columns from the .jano column specification file
nevrome Sep 10, 2025
8dc337d
specification of the now general concept of _Note columns in the README
nevrome Sep 10, 2025
48af7fe
update of changelog
nevrome Sep 10, 2025
88ad934
updated field descriptions for Individual_ID
stschiff Sep 12, 2025
3eddcb8
further update to Related_To
stschiff Sep 12, 2025
b61c9a6
removed reference to trident
stschiff Sep 12, 2025
1c55ac4
Added mention of VCF group_names and genetic_sex headers
stschiff Sep 12, 2025
57fcdab
Update README.md
nevrome Sep 13, 2025
39da7fb
Merge pull request #111 from poseidon-framework/noExpNoteFields
nevrome Sep 13, 2025
2588df0
draft of a better conceptual specification of the Poseidon_ID and the…
nevrome Sep 24, 2025
688f3fb
Merge pull request #112 from poseidon-framework/PoseidonIDDefinition
nevrome Sep 30, 2025
8ab500c
update of changelog
nevrome Oct 2, 2025
af32d80
adjusted the names of the new POSEIDON.yml fields referenceGenomeAsse…
nevrome Oct 2, 2025
a26108b
proposal how to implement the contextualisation of Alternative_IDs
nevrome Oct 2, 2025
4865fe2
added another capture type and some documentation on the individual p…
nevrome Dec 2, 2025
25af2cc
first draft of a data licencing specification
nevrome Dec 2, 2025
740b0e0
british english
nevrome Dec 2, 2025
85489fe
update of changelog
nevrome Dec 8, 2025
ad53e34
Merge pull request #113 from poseidon-framework/alternativeIDContext
nevrome Dec 8, 2025
0182dab
reverted wording in Alternative_ID about sample
stschiff Dec 9, 2025
8434d87
typo fix
stschiff Dec 9, 2025
e45e3c2
Merge branch 'dev' into add_individual_id
nevrome Dec 10, 2025
eb47355
small change in wording, update of changelog
nevrome Dec 10, 2025
9136f4d
Merge pull request #109 from poseidon-framework/add_individual_id
nevrome Dec 10, 2025
dc2beda
implemented the suggested changes
nevrome Dec 10, 2025
766a1af
update of changelog
nevrome Dec 10, 2025
46b3a94
Merge pull request #115 from poseidon-framework/licensing
nevrome Dec 10, 2025
50d2c82
renamed Carpenter2013 to WISC2013
nevrome Dec 11, 2025
b04420f
Merge pull request #114 from poseidon-framework/moreCaptureOptions
nevrome Dec 11, 2025
8e7b3f6
Mark 'name' field as required in POSEIDON_yml_fields
stschiff Jan 6, 2026
b2dc9c6
changed license spelling to US English
stschiff Jan 6, 2026
22377f5
added URL for the license
stschiff Jan 6, 2026
3fc74a8
Individual_ID is not necessarily unique within one package
nevrome Jan 8, 2026
1be349d
fixed license spelling throughout
stschiff Jan 16, 2026
6c0baa5
added url to license example
stschiff Jan 16, 2026
0237628
Merge pull request #116 from poseidon-framework/license-name-mandatory
stschiff Jan 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 10 additions & 4 deletions POSEIDON_yml_fields.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,17 @@ email 1 contributor email of one contributor String Email TRUE
orcid 1 contributor orcid of one contributor String ORCID FALSE
packageVersion 0 package version (should be changed/incremented when the package is changed) String X.Y.Z TRUE
lastModified 0 date of last modification of the package (should be updated when the package is changed) Date YYYY-MM-DD FALSE
license 0 data license section FALSE
name 1 license short name of data license that applies for this package, usually a Creative Commons license String TRUE
url 1 license URL to the license String Path TRUE
file 1 license relative path to a license file (usually not necessary, the name is sufficient for standard licenses) String Path FALSE
genotypeData 0 genotype data section TRUE
format 1 genotypeData genotype data file format String EIGENSTRAT;PLINK TRUE
genoFile 1 genotypeData relative path to the geno file String Path TRUE
genoFileChkSum 1 genotypeData md5 checksum of the geno file String md5 hash FALSE
snpFile 1 genotypeData relative path to the snp file String Path TRUE
referenceGenomeAssembly 1 genotypeData reference genome name of the reference genome used, e.g. GRCh37 String FALSE
referenceGenomeAssemblyURL 1 genotypeData reference assembly accession URL from a public database, such as NCBI or Ensembl String URL FALSE
format 1 genotypeData genotype data file format String EIGENSTRAT;PLINK;VCF TRUE
genoFile 1 genotypeData relative path to the genotype file. If gzipped, MUST end with *.gz String Path TRUE
genoFileChkSum 1 genotypeData md5 checksum of the genotype file String md5 hash FALSE
snpFile 1 genotypeData relative path to the snp file. If gzipped, MUST end with *.gz String Path TRUE
snpFileChkSum 1 genotypeData md5 checksum of the snp file String md5 hash FALSE
indFile 1 genotypeData relative path to the ind file String Path TRUE
indFileChkSum 1 genotypeData md5 checksum of the ind file String md5 hash FALSE
Expand Down
80 changes: 66 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
## The Poseidon Standard v2.7.1
## The Poseidon Standard v3.0.0

Poseidon is a solution for archaeogenetic genotype data organisation. This standard defines the core components of the Poseidon package.
Poseidon is a solution for archaeogenetic genotype data organisation. It is geared towards human data, but is to a large extent species-agnostic and can be used to track archaeogenetic data also of non-human species.

This standard defines a data structure: the **Poseidon package**. A Poseidon package stores genotype data with meta- and context information.

A .pdf version of the latest instance of this document can be downloaded [here](https://github.com/poseidon-framework/poseidon-schema/blob/master/poseidon_package_specification.pdf).

Expand All @@ -10,14 +12,26 @@ A changelog documents the changes across different schema versions [here](https:

The key words *MUST*, *MUST NOT*, *REQUIRED*, *SHALL*, *SHALL NOT*, *SHOULD*, *SHOULD NOT*, *RECOMMENDED*, *MAY*, and *OPTIONAL* in this document are to be interpreted as described in [RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119).

### Primary entities of a Poseidon package

The main operational entities in a Poseidon package are discrete sets of genotype data attributed to a single human or non-human individual, scientifically generated for archaeogenetic research questions. Within a Poseidon package each of these sets gets attributed a unique identifier: the `Poseidon_ID`.

Generally, archaeogenetics operates on depositional contexts, e.g. graves, with one or multiple (ancient) human or non-human individuals. Usually, it is possible to attribute the (skeletal) remains within these contexts to individuals based on archaeological evidence and physical-anthropological analysis. Each individual can get sampled one or multiple times, either by directly probing their preserved tissue, or by sampling any reagent that contains their DNA (through whatever pathway or taphonomic process). From one such sample one or multiple extracts can be derived, which can be transformed into one or multiple libraries, which may or may not be subjected to a DNA capture protocol and then sequenced one or multiple times. The raw sequencing data can undergo various different forms of computational processing and eventually genotyping to produce the data relevant for most derived analyses and thus stored in a Poseidon package.

While the wetlab-processes yield a relatively predictable tree of separate physical and digital products for any given sample, the computational data-processing breaks the conceptual tree-ness by allowing for arbitrary conflation of sequencing data obtained through potentially separate means: Data from different libraries, for example, may be merged if they are from the same individual, even if they are not from the same sample.

`Poseidon_ID`s therefore represent one consciously selected end-point in the complex data preparation graph laid out above. Typically this end-point corresponds to an optimal result for a given individual, research question and publication.

For the sake of convenience and despite the lack of conceptual clarity, below we sometimes use the term *sample* to denote `Poseidon_ID` entities. Data aggregation on the level of physical samples is often sensible, and the term is conventionally used for analysis endpoints in the community of practice.

### The Poseidon package structure

A Poseidon package stores genotype data with context information for DNA samples from (ancient) (human) individuals. Packages are defined by the POSEIDON.yml file, which holds relative paths to all other files in a package.
A Poseidon package is defined by the POSEIDON.yml file, which holds relative paths to all other files in the package.

A package therefore MUST contain:

- A `POSEIDON.yml` file to formally define the package
- Genotype data in PLINK or EIGENSTRAT format
- Genotype data in PLINK, EIGENSTRAT or VCF format

It SHOULD additionally contain:

Expand All @@ -44,7 +58,11 @@ Switzerland_LNBA_Roswita/README.md
Switzerland_LNBA_Roswita/CHANGELOG.md
```

All text files in the package MUST be UTF-8 encoded.
### Text encoding

All text files in the package MUST be UTF-8 encoded. They SHOULD use Unix-style line endings, so a single Line Feed (LF, `\n`) character, NOT a Carriage Return and Line Feed (CRLF) pair (`\r\n`) as in MS DOS and Windows.

`Poseidon_ID`s and `Group_Name`s, so the primary sample and group identifiers across `.janno`, `.ssf`, and genotype data files, MUST contain only characters of a subset of the 7-bit ASCII code set. Specifically the alphanumeric characters `A-Z`, `a-z`, `0-9`, and the symbols `_` (underscore), `-` (hyphen-minus), and `.` (period, dot or full stop).

### The `POSEIDON.yml` file

Expand All @@ -67,6 +85,10 @@ contributor:
email: paul.panther@example.edu
packageVersion: 1.1.2
lastModified: 2021-01-28
license:
name: CC BY 4.0
url: https://creativecommons.org/licenses/by/4.0/
file: license.md
genotypeData:
format: PLINK
genoFile: Switzerland_LNBA_Roswita.bed
Expand Down Expand Up @@ -117,27 +139,44 @@ When the `packageVersion` is changed, then the `lastModified` date MUST be updat

Packages SHOULD start at `packageVersion` `0.1.0`.

### Data licensing and the license.md file

Data licences are a common way to grant the public permission to use a dataset under copyright law.

Poseidon packages MAY specify a license, and if so, SHOULD use [Creative Commons licences](https://creativecommons.org/share-your-work/cclicenses).

Licences are documented in the `POSEIDON.yml` file in the `license` section, either with just the `name`, or with a license `file`, or with both the `name` and a `file`. `name` SHOULD include a short string with name and version of the license, e.g. `CC BY 4.0`. The `file`, typically named `license.md`, MAY include the full text of a license, or a short notifier further contextualizing the entry in the `name` field. For example:

```default
The Poseidon package Switzerland_LNBA_Roswita © 2021 by Roswita Malone is licensed under Creative Commons Attribution 4.0 International. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/
```

### Genotype data

Genotype data in Poseidon packages is stored either in (binary) PLINK or EIGENSTRAT format.
Genotype data in Poseidon packages is stored either in (binary) PLINK, EIGENSTRAT or Variant Call Format (VCF).

| | PLINK (binary) | EIGENSTRAT | VCF |
|---|---|---|---|
| genotype file | [`.bed` (binary biallelic genotype table) or `.bed.gz`](https://www.cog-genomics.org/plink/1.9/formats#bed) | [`.geno` (genotype file) or `.geno.gz`](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) | [`.vcf` or `.vcf.gz`](https://samtools.github.io/hts-specs/VCFv4.2.pdf) |
| SNP file | [`.bim` (extended MAP file) or `.bim.gz`](https://www.cog-genomics.org/plink/1.9/formats#bim) | [`.snp` (snp file) or `.snp.gz`](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) | |
| individual file | [`.fam` (sample information)](https://www.cog-genomics.org/plink/1.9/formats#fam) | [`.ind` (indiv file)](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) | |

| | PLINK (binary) | EIGENSTRAT |
|---|---|---|
| genotype file | [`.bed` (binary biallelic genotype table)](https://www.cog-genomics.org/plink/1.9/formats#bed) | [`.geno` (genotype file)](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67)
| SNP file | [`.bim` (extended MAP file)](https://www.cog-genomics.org/plink/1.9/formats#bim) | [`.snp` (snp file)](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) |
| individual file | [`.fam` (sample information)](https://www.cog-genomics.org/plink/1.9/formats#fam) | [`.ind` (indiv file)](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) |
Both PLINK and EIGENSTRAT formats require three files to be specified. In PLINK, the genotype file is binary (with 2 bits per genotype), while in Eigenstrat, the genotype file is text-based (with 8 bits per genotype). The SNP and individual files are text-based for both formats (see links behind the file endings in the table above). The EIGENSTRAT format specifically is common within archaeogenetics, compatible with many important tools, e.g. [EIGENSOFT](https://github.com/DReichLab/EIG) and [ADMIXTOOLS](https://github.com/DReichLab/AdmixTools). Finally, the VCF format is the most formally specified format, with properly versioned specifications being released regularly. VCF is well established in the wider genetics community and the de-facto standard to store variants in the field of medical genetics.

In addition to these files (and optionally their checksums), the POSEIDON.yml file SHOULD also provide a `snpSet` entry which determines the shape of the genotype file.
VCF files, as well as genotype and SNP files in PLINK and EIGENSTRAT can be stored in gzipped form, signifified by an additional file ending (`*.gz`).

To make VCF files fully convertible to PLINK and EIGENSTRAT, they MUST be biallelic and contain only genotypes coded as `0/0`, `0/1`, `1/1`, `./.`. Furthermore, they CAN encode group names and genetic sex for all samples through special header fields `##group_names=name1,name2,...` and `##genetic_sex=F,U,M,...`, respectively. If these fields are not present, then group names are assumed to be "unknown" and genetic sex "U" (unknown) for all samples.

### The `.janno` file

The `.janno` file is a tab-separated text file with a header line. It holds context information (variables/columns) for each sample (objects/rows) in a package.

- A set of strictly defined core variables (defined by column name) and their possible content are documented here: [janno_columns.tsv](https://github.com/poseidon-framework/poseidon-schema/blob/master/janno_columns.tsv)
- A `.janno` file MAY have all of these core variables, or only a subset of them.
- Only three columns MUST be present to make the file valid: **Poseidon_ID**, **Group_Name** and **Genetic_Sex**
- Only three columns MUST be present to make the file valid: **Poseidon_ID**, **Group_Name** and **Genetic_Sex**.
- Arbitrary columns not defined here MAY be added as long as their column names do not clash with the defined ones.
- The column order is irrelevant.
- Arbitrary, additional free-text information directly related to a column **<Column_Name>** from the set of specified core variables in [janno_columns.tsv](https://github.com/poseidon-framework/poseidon-schema/blob/master/janno_columns.tsv) SHOULD be added in a column whose name has the form **<Column_Name>_Note**. Example: `Contamination_Note`.
- The column order is not fixed, but MAY follow the order in [janno_columns.tsv](https://github.com/poseidon-framework/poseidon-schema/blob/master/janno_columns.tsv). **<Column_Name>_Note** columns SHOULD be placed directly after the respective column they are refering to.
- If information is unknown or a variable does not apply for a certain sample, then the respective cell(s) MAY be filled with `n/a` or simply an empty string.
- The order of the samples (rows) in the `.janno` file MUST be equal to the order in the genetic data files (`.ind`, `.fam`) in the package.
- The values in the columns **Poseidon_ID**, **Group_Name** and **Genetic_Sex** MUST be equal to the terms used in the genetic data files (`.ind`, `.fam`).
Expand Down Expand Up @@ -206,3 +245,16 @@ The `.ssf` file is another tab-separated text file with a header line. It stores
- If information is unknown or a variable does not apply, then the respective cell(s) MAY be filled with `n/a` or simply an empty string.
- Multiple predefined columns of the `.ssf` file are list columns that can hold multiple values (either strings or numerics) separated by `;`.
- The decimal separator for all floating point numbers MUST be `.`.

### Details

#### The `Capture_Type` .janno column

The following protocols are specified:

- `Shotgun`: Sequencing without any enrichment (whole genome sequencing, screening etc.).
- `1240K`: Target enrichment with hybridization capture optimised for sequences covering the 1240k SNP array, see [@Fu2015](https://doi.org/10.1038/nature14558), [@Haak2015](https://doi.org/10.1038/nature14317), [@Mathieson2015](https://doi.org/10.1038/nature16152).
- `ArborComplete`, `ArborPrimePlus`, `ArborAncestralPlus`: Target enrichment with hybridization capture as provided by Arbor Biosciences in three different kits branded [myBaits Expert Human Affinities](https://arborbiosci.com/genomics/targeted-sequencing/mybaits/mybaits-expert/mybaits-expert-human-affinities).
- `TwistAncientDNA`: Target enrichment with hybridization capture as provided by Twist Bioscience [@Rohland2022](https://doi.org/10.1101/gr.276728.122).
- `WISC2013`: Whole genome capture as described by [@Carpenter2013](10.1016/j.ajhg.2013.10.002).
- `OtherCapture`: Target enrichment with hybridization capture for any other set of sequences.
59 changes: 59 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,64 @@
# Changelog

### 2.7.1 -> 3.0.0 [breaking]

#### General changes

- Introcuded a specific, limited character set for `Poseidon_ID`s and `Group_Name`s (in the .janno file, the .ssf file, and the genotype data): The ASCII characters `A-Za-z0-9_-.`.
- Allowed another genotype data format next to (binary) PLINK and EIGENSTRAT: the Variant Call Format (VCF).
- Specified a mechanism to store genotype data in a more space-efficient gzipped form.

#### Clarifications

- Clarified the exact meaning of a `Poseidon_ID` and the entity in genotype and context data it represents.
- Clarified the suitability of the Poseidon standard for non-human data: `[Poseidon] is geared towards human data, but is to a large extent species-agnostic and can be used to track archaeogenetic data also of non-human species.`
- Clarified that text files in Poseidon packages should use Unix-style line endings.

#### Changes to the `POSEIDON.yml` file

- Added the optional section `license` with the fields `name` and `file` to specify a data license for a package.
- Added two optional fields within the `genotypeData` structure:
- `referenceGenomeAssembly`, the reference genome name of the reference genome used, e.g. GRCh37
- `referenceGenomeAssemblyURL`, the reference assembly accession URL from a public database, such as NCBI or Ensembl
- Modified the definition of the `genoFile` and `snpFile` fields to cover the case of gzipped data, for which the respective file names must end with `*.gz`.

#### Changes to the `.janno` file

##### Replaced columns

- Replaced `Source_Tissue` with `Source_Material` and `Source_Material_Note`.

##### Added columns

- Added a column `Individual_ID` as an identifier on the level of (human/animal) individuals.
- Added a column for the sampled `Species`, to make the schema more explicitly species-agnostic.
- Added a column `Alternative_IDs_Context` to document what exactly the "foreign keys" in `Alternative_IDs` are referring to. This is a list column with the same number and order of entries as `Alternative_IDs`.
- Added a `Custodian_Institution` column that documents the institution that curated the sampled remains at the time of sampling, with name, city and country.
- Added four list columns to describe the cultural eras and archaeological cultures a sample is associated with: `Cultural_Era` + `Cultural_Era_URL` and `Archaeological_Culture` + `Archaeological_Culture_URL`.
- Added the columns `Chromosomal_Anomalies` and `Chromosomal_Anomalies_Note` for genetic anomalies on the chromosome level detected for the sample. This includes extra, missing or irregual portions of chromosomal DNA like in gonosomal and autosomal aneuploidies. `Chromosomal_Anomalies` is not limited to a specific set of options, but a common notation is recommended (e.g. `XXY`, `XYY`, `XXX`, `X0`, `Trisomy21`, `Trisomy18`).

##### Changed columns

- Introcuded a specific, limited character set for the `Poseidon_ID` and `Group_Name` column: The ASCII characters `A-Za-z0-9_-.`.
- Adjusted the definition of the `Group_Name` column. The role of population labels as general analysis labels was emphasised, and the original recommendation for the geographic-temporal nomenclature proposed by Eisenmann et al. 2018 toned down.
- Changed the definition of the `Relation_` columns (`Relation_To`, `Relation_Degree`, `Relation_Type`) to operate on the level of individuals, not samples (`Individual_ID`, instead of `Poseidon_ID`).
- Made the `Collection_ID` column a list column that allows multiple entries separated by `;`.
- Removed `ReferenceGenome` as an option for the `Capture_Type` column and further clarified its definition.
- Changed the scaling of the columns `Endogenous` and `Damage` from percent (0-100) to fractions (0-1).
- Allowed multiple values in the `Damage` column for estimates per library.
- Slightly adjusted the definitions of `MT_Haplogroup` and `Y_Haplogroup` to better account for non-human data.
- Added the option `WISC2013` to `Capture_Type`.

##### Removed columns

- Removed all explicitly defined `_Note` columns. The schema allows arbitrary additional columns since v2.2.0; a specification of free-text fields is not necessary.

#### Changes to the `.ssf` file

##### Added columns

- Added a `submitted_md5` column, which records the md5sum of the file in the `submitted_ftp` column.

### 2.7.0 -> 2.7.1 [not breaking]

Only changes to the definition of the Sequencing Source File (`.ssf`):
Expand Down
Loading