MHCgnomes is a parsing library for multi-species MHC nomenclature which aims to correctly parse every name in IEDB, IMGT/HLA, IPD/MHC, and the allele lists for both NetMHCpan and NetMHCIIpan predictors. This allows for standardization between immune databases and tools, which often use different naming conventions.
In [1]: mhcgnomes.parse("HLA-A0201")
Out[1]: Allele(
gene=Gene(
species=Species(name="Homo sapiens", mhc_prefix="HLA"),
name="A"),
allele_fields=("02", "01"),
annotations=(),
mutations=())
In [2]: mhcgnomes.parse("HLA-A0201").to_string()
Out[2]: 'HLA-A*02:01'
In [3]: mhcgnomes.parse("HLA-A0201").compact_string()
Out[3]: 'A0201'After installation, a mhcgnomes CLI is available:
mhcgnomes "HLA-A*02:01" "DQ2.5"
# or:
python -m mhcgnomes "HLA-A*02:01" "DQ2.5"This prints a table with:
- input string
- parsed result type
- normalized and compact forms
- species/gene/MHC class
- parsed properties from
to_record()
You can also use machine-friendly output:
mhcgnomes --format tsv "HLA-A*02:01" "HLA-A2"
mhcgnomes --format json "HLA-A*02:01" "not a real allele"By default, unparseable values are shown as ParseError rows.
Use strict mode to fail fast:
mhcgnomes --strict "not a real allele"Despite the valiant efforts of groups such as the Comparative MHC Nomenclature Committee, the names of MHC alleles you might encounter in different datasets (or accepted by immunoinformatics tools) are frustratingly ill specified. It's not uncommon to see dozens of different forms for the same allele.
For example, these all refer to the same MHC protein sequence:
- "HLA-A*02:01"
- "HLA-A02:01"
- "HLA-A:02:01"
- "HLA-A0201"
Additionally, for human alleles, the species prefix is often omitted:
- "A*02:01"
- "A*0201"
- "A02:01"
- "A:02:01"
- "A0201"
Sometimes, alleles are bundled with modifier suffixes which specify the functionality or abundance of the MHC. Here's an example with an allele which is secreted instead of membrane-bound:
- "HLA-A*02:01:01S"
These are collected in the annotations field of an
Allele
result.
Multi-letter annotations are also used in some non-human systems. In particular,
Ps (pseudogene) and Sp (splice variant) appear as suffixes on allele fields,
e.g. Mamu-B*074:03Sp or Caja-B5*01:01Ps, and are parsed into the
annotations field as Sp or Ps respectively.
Note that Ps can also appear as part of a gene name (prefix or suffix) in
non-human primates, such as Caja-G2Ps*01. In those cases Ps is treated as
part of the gene name, not an allele annotation.
MHC proteins are sometimes described in terms of mutations to a known allele.
- "HLA-B*08:01 N80I mutant"
These mutations are collected in the mutations field of an
Allele result.
To make things worse, several model organisms (like mice and rats) use archaic naming systems, where there is no notion of allele groups or four/six/eight digit alleles but every allele is simply given a name, such as:
- "H2-Kk"
- "RT1-9.5f"
In the above example "H2"/"RT1" correspond to species, "K"/"9.5" are the gene names and "k"/"f" are the allele names.
To make these even worse, the name of a species is subject to variation (e.g. "H2" vs. "H-2") as well as drift over time (e.g. ChLA -> MhcPatr -> Patr).
Besides alleles there are also other named MHC related entities you'll encounter in immunological data. Closely related to alleles are serotypes, which effectively denote a grouping of alleles that are all recognized by the same antibody:
- "HLA-A2"
- "A2"
Supertypes are functional groupings based on shared peptide-binding specificity rather than serological reactivity (Sidney et al. 2008). These are parsed when the "supertype" keyword is present:
- "A2 supertype"
- "HLA-B44 supertype"
Class II heterodimers can be specified using dot notation, which is common in celiac disease literature:
- "DQ2.5" (equivalent to DQA1*05:01/DQB1*02:01)
- "DQ8.5"
In many datasets the exact allele is not known but an experiment might note the genetic background of a model animal, resulting in loose haplotype restrictions such as:
- "H2-k class I"
Yes, good luck disambiguating "H2-k" the haplotype from "H2-K" the gene, especially since capitalization is not stable enough to be relied on for parsing.
In some cases immunological data comes only with a denoted species (e.g. "mouse"), a gene (e.g. "HLA-A"), or an MHC class ("human class I"). MHCgnomes has a structured representation for all of these cases and more.
It is a fool's errand to curate all possible MHC allele names since that list grows daily as the MHC loci of more people (and non-human animals) are sequenced. Instead, MHCgnomes contains an ontology of curated species and genes and then attempts to parse any given string into a multiple candidates of the following types:
The set of candidate interpretations for each string are then
ranked according to heuristic rules. For example, a string will be
preferentially interpreted as an Allele rather
than a Serotype
or Haplotype.
Originally alleles for many genes were numbered with two digits:
- "HLA-MICB*01"
But as the number of identified alleles increased, the number of fields specifying a distinct protein increase to two. This became conventionally called a "four digit" format, since each field has two digits. Yet, as the number of identified alleles continued to increase, then the number of digits per field has often increased from two to three:
- "MICB*002:01"
- "HLA-A00201"
- "A:002:01"
- "A*00201"
These are not always currently treated as equivalent to allele strings with two digits in their first field, but that feature is in the works.
However, if databases such as IPD-MHC or IMGT-HLA recorded an older form of an allele, then MHCgnomes can optionally map it onto the modern version (including capturing differences in numbers of digits per field).
- IPD-MHC: nomenclature requirements for the non-human major histocompatibility complex in the next-generation sequencing era
- Comparative MHC nomenclature: report from the ISAG/IUIS-VIC committee 2018
- ISAG/IUIS-VIC Comparative MHC Nomenclature Committee report, 2005
- Marsupial MHC Class II β Genes Are Not Orthologous to the Eutherian β Gene Families
- Nomenclature for factors of the SLA system, update 2008
