The AMRrulevalidator package provides tools for validating AMRrules files according to the current specification (v0.6).
As part of the install, AMRrulevalidator will download the current CARD ontology and AMRFinderPlus resources to validate against.
The validate subcommand will print to stdout a summary of checks that have been completed, and whether they've passed or failed, and will write out a version of the rules file where cells are flagged with values that need to be checked.
The clean subcommand will write out a cleaned version of a rules file after all values have been checked, which will be ready for integration into the AMRrules interpretation engine.
AMRrulevalidator is compatible with Python >= 3.8. The only dependency is obonet v1.1.1.
The easiest installation method is via pip from GitHub, which will install all required dependencies for you.
# Optional: create a conda environment to install package into
conda create -n amrrulevalidator
conda activate amrrulevalidator
conda install pip
# Install with pip from GitHub
pip install git+https://github.com/AMRverse/AMRrulevalidator.git
# Download CARD and AMRFinderPlus ontology files
amrrule update-resourcesFor development or local installation. This will:
- Install the package in editable mode
- Download and set up all required resource files
# Clone the repository
git clone https://github.com/AMRverse/AMRrulevalidator.git
cd AMRrulevalidator
# Install in development mode
make devThe validator relies on several external files including the CARD ontology and files from the AMRFinderPlus database. To update these files:
amrrule update-resourcesThis will download the latest versions of:
- CARD ontology (ARO) (v4.0.1)
- CARD drug names and drug classes (v4.0.1)
- NCBI taxonomy data (from CARD (v4.0.1))
- AMRFinderPlus Reference Gene Hierarchy, Reference Gene Accessions and HMM accessions (using
latestversion)
To validate a draft AMRrules file:
amrrule validate --input path/to/draft_rules.tsv --output path/to/validated_rules.tsvThis will:
- Check the input file against the current AMRrules specification
- Generate a validated output file with annotations for any errors
- Print a summary of validation results to the console
During validation, the script annotates problematic values in the output file to help identify and fix issues:
ENTRY MISSING: Indicates that a required value is missing in a field where a value is expected.CHECK VALUE: [value]: Indicates that the existing value doesn't match the expected format or isn't in the list of allowed values.
The rules files must contain the following columns:
- ruleID
- txid
- organism
- gene
- nodeID
- protein accession
- HMM accession
- nucleotide accession
- ARO accession
- mutation
- variation type
- gene context
- drug
- drug class
- phenotype
- clinical category
- breakpoint
- breakpoint standard
- breakpoint condition
- PMID
- evidence code
- evidence grade
- evidence limitations
- rule curation note
Any columns which do not exist in the file will be added, with all values sent to ENTRY MISSING.
The validator performs a series of checks, with each focusing on specific columns:
-
ruleID: Checks that rule IDs are unique and have a consistent prefix.
-
txid: Validates that taxonomic IDs exist in the NCBI taxonomy database.
-
organism: Confirms that organism names are valid NCBI taxonomy names and follow the format
s__[organism name]. -
txid-organism pairs: Ensures that each txid is correctly paired with its corresponding organism name.
-
gene: Checks that the gene column is not empty. For combination rules, it verifies that referenced rule IDs exist.
-
Accession checks:
- nodeID: Verifies node IDs against the AMRFinderPlus Reference Gene Hierarchy.
- protein accession: Verifies accessions against the AMRFinderPlus Reference Gene Catalog.
- nucleotide accession: Verifies accessions against the AMRFinderPlus Reference Gene Catalog.
- HMM accession: Verifies accessions against the AMRFinderPlus HMM accessions list.
- At least one of these accessions must be present unless the variation type is "Combination".
Note: Accessions are checked only against the listed reference files - if you have an accession that has come from elsewhere, this value may be flagged as
CHECK VALUE:. -
ARO accession: Validates ARO accessions against the CARD ontology.
-
variation type: Confirms values match one of the allowed variation types, as per the spec. Value must be supplied, cannot be empty.
-
mutation and variation type compatibility: Ensures the mutation format is compatible with the specified variation type.
-
gene context: Validates against allowed values (
coreoracquired). Value must be supplied, cannot be empty. -
drug and drug class: Confirms values exist in the CARD drug and drug class lists. A value in one of these columns must be supplied, cannot be empty.
-
phenotype: Validates against allowed values (
wildtypeornonwildtype). Value must be supplied, cannot be empty. -
clinical category: Confirms values match the allowed clinical categories (
S,I, orR). Value must be supplied, cannot be empty. -
breakpoint: Checks to see if the breakpoint value is consistent with the clinical category. If not, flag for checking. A value must be supplied, if no breakpoint is required then
not applicableis a valid entry. -
breakpoint standard: Checks to see if the given breakpoint standard source includes information about version, or month/year when standard was set. Flags as a value to check if it doesn't match.
-
breakpoint condition: If provided, confirms values match the allowed breakpoint conditions.
-
PMID: Checks only that there is an entry in this column, as most rules should have a paper associated with them.
-
evidence code: Checks that the codes provided start with the
ECO:prefix. Will check if they are listed as one of the suggested Evidence Code Ontology codes in the spec - if not, flags as a code to check manually. Checks to make sure multiple codes are separated by a comma and not some other delimiter. -
evidence grade and limitations: Validates evidence grades against allowed values. For evidence limitations, checks that mulitple limitations are separated by a comma, and not a different delimiter. Checks limitations are one of the allowed values. Checks that if evidence grade is not
high, an evidence limitation is provided.
Allowable values for some columns are specified within constants.py, and should match the current version of the AMRrules spec.
After validation, you can clean a rules file to prepare it for import into the interpretation engine:
amrrule clean --input path/to/validated_rules.tsv --output path/to/clean_rules.tsvCan take already validated v0.5 rules files and do a programmatic update to v0.6.
amrrule convert-to-latest-spec --input path/to/old_rules.tsv --output path/to/updated_rules.tsvAdds new columns as specified in spec.
For txid, will use value in organism to look up the relevant txid in the NCBI taxonomy, if not found will enter UNKNOWN.
For refseq and GenBank accessions, will look up accessions in the AMRFinderPlus reference database to try and determine if they are protein or nucleotide accessions, and update accordingly. If unclear, will add to one of the columns and add CHECK ACCESSION TYPE for manual fixing.
Adds breakpoint standard column but fills with ADD VALUE, which will likely be - in most cases.
Updates evidence grade column: strong becomes high, weak gets flagged with UPDATE TO low or very low, so the user can select the most appropriate option.
This project is licensed under the GNU General Public License v3.0.
Code was developed by Jane Hawkey, with input from Kat Holt and Natacha Couto.
For issues or questions, please use the GitHub issue tracker.