Skip to content

ncbo/umls2rdf

Repository files navigation

This project takes a MySQL Unified Medical Language System (UMLS) database and converts the ontologies to RDF using OWL and SKOS as the main schemas.

Virtual Appliance users can review the documentation in the OntoPortal Administration Guide.

Recommended workflow:

  • Install Python dependencies with pip install -r requirements.txt
  • Configure conf.py
  • Specify the SAB ontologies to export in umls.conf
  • Run the full resumable import/export pipeline with python run_umls_pipeline.py

Generated TTL files are written under a versioned output directory based on OUTPUT_FOLDER from conf.py. A common pattern is OUTPUT_FOLDER = "output/%s" % UMLS_VERSION.upper(), which writes to output/2025AB.

The umls.conf configuration file must contain one ontology per line. The lines are comma separated tuples where the elements are:

The following list needs updating.

(0) SAB
(1) BioPortal Virtual ID. This is optional, any value works.
(2) Output file name
(3) Conversion strategy. Accepted values (load_on_codes, load_on_cuis).

Note that 'CCS COSTAR DSM3R DSM4 DXP ICPC2ICD10ENG MCM MMSL MMX MTHCMSFRF MTHMST MTHSPL MTH NDFRT SNM' have no code and should not be loaded on loads_on_codes.

umls2rdf.py is designed to be an offline, run-once process. It's memory intensive and exports all of the default ontologies in umls.conf in 3h 30min. The ontologies listed in umls.conf are the UMLS ontologies accessible in BioPortal.

To download the full UMLS release archive outside the full pipeline, run:

python download_umls.py

The downloader returns the local path to the downloaded archive. This step only fetches and extracts the pre-built UMLS release; you still need to load the UMLS tables into MySQL before running umls2rdf.py. The script uses UMLS_VERSION and UMLS_API_KEY from conf.py. If UMLS_DOWNLOAD_DIR is set, the zip archive is stored under that directory. If it is not set, the library default ~/.data/bio/umls is used. By default, the archive is extracted into an extracted subdirectory next to the downloaded zip. You can override that location with UMLS_EXTRACT_DIR.

To create the target MySQL database with explicit UTF-8 settings outside the full pipeline, run:

python create_mysql_db.py

The script creates or updates DB_NAME from conf.py with utf8mb4 character set and utf8mb4_unicode_ci collation.

To run the full UMLS pipeline end-to-end, use:

python run_umls_pipeline.py

The pipeline performs these stages:

  • Download the configured UMLS full release archive
  • Extract the release only when the extracted META directory is not already present
  • Recreate the configured DB_NAME and load it with the extracted META/populate_mysql_db.sh script
  • Run umls2rdf.py

The pipeline patches loader settings from conf.py into a generated copy of populate_mysql_db.sh, and it patches META/mysql_tables.sql in place to replace @LINE_TERMINATION@. Pipeline state is stored under PIPELINE_WORK_DIR (default: data/pipeline/<UMLS_VERSION>) and reruns skip completed steps after validating the extracted files, MySQL tables, and RDF output. Add MYSQL_HOME to conf.py; if your MySQL client is at /usr/bin/mysql, set MYSQL_HOME = "/usr". Pipeline stdout and stderr are appended to PIPELINE_LOG_FILE when set, or to data/pipeline/<UMLS_VERSION>/pipeline.log by default.

If PROCESS_ONLY_CURRENT_UMLS_VERSION is set to True, the exporter only processes ontologies whose MRSAB.IMETA exactly matches UMLS_VERSION. Ontologies with a different value are skipped and logged.

About

These python scripts connect to the Unified Medical Language System (UMLS) database and translate the ontologies into RDF/OWL files. This is part of the BioPortal project.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors