This project takes a MySQL Unified Medical Language System (UMLS) database and converts the ontologies to RDF using OWL and SKOS as the main schemas.
Virtual Appliance users can review the documentation in the OntoPortal Administration Guide.
Recommended workflow:
- Install Python dependencies with
pip install -r requirements.txt - Configure
conf.py - Specify the SAB ontologies to export in
umls.conf - Run the full resumable import/export pipeline with
python run_umls_pipeline.py
Generated TTL files are written under a versioned output directory based on
OUTPUT_FOLDER from conf.py. A common pattern is
OUTPUT_FOLDER = "output/%s" % UMLS_VERSION.upper(), which writes to
output/2025AB.
The umls.conf configuration file must contain one ontology per line. The lines are comma separated tuples where the elements are:
The following list needs updating.
(0) SAB (1) BioPortal Virtual ID. This is optional, any value works. (2) Output file name (3) Conversion strategy. Accepted values (load_on_codes, load_on_cuis).
Note that 'CCS COSTAR DSM3R DSM4 DXP ICPC2ICD10ENG MCM MMSL MMX MTHCMSFRF MTHMST MTHSPL MTH NDFRT SNM' have no code and should not be loaded on loads_on_codes.
umls2rdf.py is designed to be an offline, run-once process. It's memory intensive and exports all of the default ontologies in umls.conf in 3h 30min. The ontologies listed in umls.conf are the UMLS ontologies accessible in BioPortal.
To download the full UMLS release archive outside the full pipeline, run:
python download_umls.py
The downloader returns the local path to the downloaded archive. This step only
fetches and extracts the pre-built UMLS release; you still need to load the
UMLS tables into MySQL before running umls2rdf.py. The script uses
UMLS_VERSION and UMLS_API_KEY from conf.py.
If UMLS_DOWNLOAD_DIR is set, the zip archive is stored under that
directory. If it is not set, the library default ~/.data/bio/umls
is used. By default, the archive is extracted into an
extracted subdirectory next to the downloaded zip. You can override
that location with UMLS_EXTRACT_DIR.
To create the target MySQL database with explicit UTF-8 settings outside the full pipeline, run:
python create_mysql_db.py
The script creates or updates DB_NAME from conf.py
with utf8mb4 character set and
utf8mb4_unicode_ci collation.
To run the full UMLS pipeline end-to-end, use:
python run_umls_pipeline.py
The pipeline performs these stages:
- Download the configured UMLS full release archive
- Extract the release only when the extracted
METAdirectory is not already present - Recreate the configured
DB_NAMEand load it with the extractedMETA/populate_mysql_db.shscript - Run
umls2rdf.py
The pipeline patches loader settings from conf.py into a generated
copy of populate_mysql_db.sh, and it patches
META/mysql_tables.sql in place to replace
@LINE_TERMINATION@. Pipeline state is stored under
PIPELINE_WORK_DIR (default:
data/pipeline/<UMLS_VERSION>) and reruns skip completed steps
after validating the extracted files, MySQL tables, and RDF output. Add
MYSQL_HOME to conf.py; if your MySQL client is at
/usr/bin/mysql, set MYSQL_HOME = "/usr". Pipeline
stdout and stderr are appended to PIPELINE_LOG_FILE when set, or
to data/pipeline/<UMLS_VERSION>/pipeline.log by default.
If PROCESS_ONLY_CURRENT_UMLS_VERSION is set to True,
the exporter only processes ontologies whose MRSAB.IMETA exactly
matches UMLS_VERSION. Ontologies with a different value are skipped
and logged.