Skip to content

Feature Request: Add getCodelistFromConceptSet() function for direct database query of concept sets #248

@merqurio

Description

@merqurio

Currently, CodelistGenerator provides codesFromConceptSet() and codesFromCohort() functions that extract codelists from JSON files containing concept set expressions or cohort definitions.

We propose to expand the management of the concept_sets to the database. In some OMOP CDM setups (such as those managed by IOMED), concept_sets are stored directly in database tables (concept_set, concept_set_item, etc.) within the same database instance as the analysis data.

This feature request proposes adding a new function, getCodelistFromConceptSet(), that queries these database tables directly to build formal codelist objects, similar to how other functions in the package query vocabulary tables directly (e.g., getDrugIngredientCodes(), getICD10StandardCodes()).

Rationale

Cleaner workflow: Eliminates the need to export/import JSON files when concept sets are already stored natively in the database.
Consistency: Aligns with the package's philosophy of direct database queries for vocabulary-based codelists.
Tested workflow: At IOMED, we maintain concept sets in dedicated database tables within the OMOP instance, allowing for streamlined querying without intermediate file handling.
Efficiency: Reduces overhead of JSON parsing and file I/O when database access is already available.

Proposed Database Schema

The function would work with the OMOP CDM tables and a small extension:

erDiagram
    concept_set ||--o{ concept_set_item : "has items"
    concept ||--o{ concept_set_item : "is included in"
    concept_set {
        int concept_set_id PK
        text concept_set_name
    }
    concept {
        int concept_id PK
        varchar concept_name
        varchar domain_id
        varchar vocabulary_id
        varchar concept_class_id
        varchar standard_concept
        varchar concept_code
        date valid_start_date
        date valid_end_date
        varchar invalid_reason
    }
    concept_set_item {
        int concept_set_id PK,FK
        int concept_id PK,FK
    }
    concept_class ||--o{ concept : "classifies"
    domain ||--o{ concept : "belongs to"
    vocabulary ||--o{ concept : "from"
Loading

Proposed Function Signature and Implementation

See OmopHelpers for the full implementation.

getCodelistFromConceptSet <- function(conceptSetId, con, cdmSchema) {
  # Point to the required tables in the database
  concept_set_tbl <- dplyr::tbl(con, dbplyr::in_schema(cdmSchema, "concept_set"))
  concept_set_item_tbl <- dplyr::tbl(con, dbplyr::in_schema(cdmSchema, "concept_set_item"))

  # Retrieve the name of the concept set to use as the codelist name
  codelistName <- concept_set_tbl |>
    dplyr::filter(.data$concept_set_id == conceptSetId) |>
    dplyr::pull("concept_set_name") |>
    unique()

  # Error handling: check if the concept set ID was found
  if (length(codelistName) == 0) {
    stop(glue::glue("No concept set found for concept_set_id: {conceptSetId}"))
  }
  # Warning if multiple names exist for the same ID
  if (length(codelistName) > 1) {
    warning(glue::glue("Multiple names found for concept_set_id: {conceptSetId}. Using the first one: '{codelistName[1]}'"))
    codelistName <- codelistName[1]
  }

  codelistName <- clean_name(codelistName)

  # Retrieve all unique concept IDs associated with the concept set ID
  concept_ids <- concept_set_item_tbl |>
    dplyr::filter(.data$concept_set_id == conceptSetId) |>
    dplyr::pull("concept_id") |>
    unique()

  # Create a named list structure required by newCodelist
  codelist <- list(concept_ids) |>
    magrittr::set_names(codelistName)

  # Return the formal, validated codelist object
  return(omopgenerics::newCodelist(codelist))
}

Implementation Details

The function would:

  1. Query concept_set table: Retrieve the concept_set_name for the given conceptSetId to use as the codelist name.
  2. Query concept_set_item table: Get all associated concept_ids for the concept set.
  3. Name cleaning: Apply name standardization (e.g., via a clean_name() helper function).
  4. Codelist creation: Build a named list and return an omopgenerics::newCodelist object.
  5. Error handling: Validate that the concept set exists and handle edge cases like multiple names.

Dependencies

• Requires omopgenerics package for newCodelist()
• Uses dplyr for database operations
• Assumes clean_name() helper function (could be added or use existing package utilities)

Related Functions

• codesFromConceptSet(): Current JSON-based approach
• getDrugIngredientCodes(): Similar direct database querying pattern
• getICD10StandardCodes(): Another vocabulary table query function

Testing Considerations

• Unit tests with mock database containing concept_set tables
• Integration tests with real OMOP CDM databases
• Edge case testing (missing concept sets, empty results, etc.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions