UniProtXML2PEFF is a command-line tool designed to convert UniProt XML files into PEFF (PSI extended FASTA format), enabling better compatibility with proteomics tools such as Comet.
This tool processes sequence variants and modifications in UniProt XML files, accurately encoding them into PEFF format with \VariantSimple, \VariantComplex, and \ModResPsi annotations. The result is optimized for use in proteomics database searching.
This project is not well tested and should be considered experimental until further notice.
PEFF files directly retrieved from UniProt via their API (e.g., curl -s "https://www.ebi.ac.uk/proteins/api/variation/P09032?format=peff") may not always be suitable for Comet or other proteomics tools. Additionally, modifications and variants encoded in the UniProt XML files appear different from those returned via the UniProt API.
UniProtXML2PEFF provides an alternative mechanism for generating PEFF files by:
- Extracting specific sequence variant annotations directly from UniProt XML.
- Mapping modifications (e.g., phosphorylations, methylations) to PSI-MOD identifiers for encoding in
\ModResPsi. - Skipping or reporting entries that lack sufficient information for proper PEFF annotations.
- Converts UniProt XML files to PEFF.
- Processes:
- Post-translational modifications (PTMs) such as phosphorylations and acetylations are encoded into
\ModResPsiwith PSI-MOD identifiers. Enabled by default. - Simple substitutions (e.g., A → V) into
\VariantSimple. Disabled by default; use--variant-simpleto enable. - Complex variants such as deletions, insertions, and multi-residue substitutions into
\VariantComplex. Disabled by default; use--variant-complexto enable.
- Post-translational modifications (PTMs) such as phosphorylations and acetylations are encoded into
- Logs skipped or unsupported variants and modifications for audit and troubleshooting.
- Provides flexible command-line options to control which features are processed.
To build this tool, you'll need a C++ compiler that supports C++11 and the TinyXML-2 library.
-
Clone the repository:
git clone https://github.com/your-username/UniProtXML2PEFF.git cd UniProtXML2PEFF -
Build the executable:
make
This will generate the
UniProtXML2PEFF.exeexecutable.
The tool requires an input UniProt XML file and an output PEFF file path.
./UniProtXML2PEFF.exe <input.xml> <output.peff> [options]
<input.xml>: The path to the input UniProt XML file.<output.peff>: The desired path for the PEFF output file.
--strict(optional): Enforces strict handling of annotations. If unsupported variants or modifications are encountered, the program will exit with an error.--no-ptms(optional): Disables PTM (post-translational modification) processing. By default, PTMs are enabled and encoded in\ModResPsi.--variant-simple(optional): Enables VariantSimple processing. By default, VariantSimple is disabled. When enabled, simple single-residue substitutions are encoded in\VariantSimple.--variant-complex(optional): Enables VariantComplex processing. By default, VariantComplex is disabled. When enabled, complex variants such as deletions, insertions, and multi-residue substitutions are encoded in\VariantComplex.
By default, only PTMs are processed. If you want to include variant information, you must explicitly enable it using the appropriate command-line options.
- PTMs (ModResPsi): Enabled by default. Use
--no-ptmsto disable. - VariantSimple: Disabled by default. Use
--variant-simpleto enable. - VariantComplex: Disabled by default. Use
--variant-complexto enable.
./UniProtXML2PEFF.exe test.xml test.peff./UniProtXML2PEFF.exe test.xml test.peff --variant-simple --variant-complex./UniProtXML2PEFF.exe test.xml test.peff --no-ptms --variant-simple --variant-complex./UniProtXML2PEFF.exe test.xml test.peff --variant-simple./UniProtXML2PEFF.exe test.xml test.peff --strict --variant-simple --variant-complexThe input file should adhere to the standard UniProt XML format. The tool expects <feature> elements under <entry> with type values such as:
sequence variantmutagenesis sitemodified residue
Each <feature> element can include:
-
Substitutions:
<feature type="sequence variant" description="Substitution A to V."> <original>A</original> <variation>V</variation> <location> <position position="15"/> </location> </feature>
-
Deletions:
<feature type="sequence variant" description="Deletion of residues."> <location> <begin position="20"/> <end position="22"/> </location> </feature>
-
Modifications: Marked as
modified residuein UniProt XML, these modifications are mapped to PSI-MOD identifiers in the PEFF output.Example XML:
<feature type="modified residue" description="Phosphothreonine."> <location> <position position="10"/> </location> </feature>
The generated PEFF files include headers that indicate which features are enabled:
# VariantSimple=true|false: Indicates whether VariantSimple processing is enabled.# VariantComplex=true|false: Indicates whether VariantComplex processing is enabled.# ModResPsi=true|false: Indicates whether PTM processing is enabled.
The PEFF entries include the following annotations (when enabled):
-
VariantSimple: Encodes single residue substitutions. Example (residue 15 in sequence is substituted to 'V'):\VariantSimple=(15|V) -
VariantComplex: Encodes multi-residue changes, deletions, or insertions. Examples (residues 20 to 22 are deleted; residue 6 is substituted for 'GP'):\VariantComplex=(20|22|) \VariantComplex=(6|6|GP) -
ModResPsi: Encodes PTMs with PSI-MOD identifiers.- The tool maps modification descriptions (e.g.,
Phosphothreonine) to PSI-MOD identifiers (MOD:00047) using a predefined mapping. Example:
\ModResPsi=(10|MOD:00047)(20|MOD:00046) - The tool maps modification descriptions (e.g.,
# PEFF 1.0 generated by https://github.com/UWPR/UniProtXML2PEFF
# VariantSimple=false
# VariantComplex=false
# ModResPsi=true
>tr|TEST123| \ModResPsi=(10|MOD:00047)
AAAAAGGGGT
# PEFF 1.0 generated by https://github.com/UWPR/UniProtXML2PEFF
# VariantSimple=true
# VariantComplex=true
# ModResPsi=true
>tr|TEST123| \VariantSimple=(4|V) \VariantComplex=(6|8|) \ModResPsi=(10|MOD:0047)
AAAAAGGGGT
The tool uses a predefined map of modification descriptions to PSI-MOD identifiers. For example:
"Phosphothreonine" → MOD:00046
"Phosphoserine" → MOD:00047
"Acetylation" → MOD:00394
Modifications in the XML that do not match this map will be skipped unless the --strict option is used.
The tool generates two audit files to track skipped or processed annotations:
-
variant_skipped.csv: Logs the reasons for skipped variants and modifications, including:- Unsupported feature types.
- Modifications or variants with missing location data.
-
variant_complex.csv: Summarizes the types and counts of\VariantComplexentries generated.
-
Missing Modifications in Output:
- Check the
variant_skipped.csvfile to confirm if modifications were skipped due to missing PSI-MOD mappings.
- Check the
-
Strict Mode Fails:
- Use the
--strictoption for debugging. If the tool fails, inspect the logs for skipped features.
- Use the
-
SGRP in Retrieved PEFF Files:
- If you retrieve PEFF files directly from UniProt and encounter entries like
\VariantSimple=(8|L|SGRP), use this tool to regenerate the PEFF with specific substitutions or variants.
- If you retrieve PEFF files directly from UniProt and encounter entries like
Run with standard error redirection to debug processing:
./UniProtXML2PEFF.exe input.xml output.peff 2> debug.logThis project is open-source and distributed under the MIT License.