From 0bd7547ec5a07437c297ba4264c043e72b3e2211 Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Tue, 5 Mar 2024 12:13:29 -0800 Subject: [PATCH 01/19] Create README.md for MeSH import --- .../Medical_Subject_Headings/README.md | 132 ++++++++++++++++++ 1 file changed, 132 insertions(+) create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md new file mode 100644 index 0000000000..0db458bb0a --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md @@ -0,0 +1,132 @@ +# Importing Medical Subject Headings (MeSH) data from NCBI + +## Table of Contents + +1. [About the Dataset](#about-the-dataset) + 1. [Download Data](#download-data) + 2. [Overview](#overview) + 3. [Notes and Caveats](#notes-and-caveats) + 4. [dcid Generation](#dcid-generation) + 5. [License](#license) + 6. [Dataset Documentation and Relevant Links](#dataset-documentation-and-relevant-links) +2. [About the import](#about-the-import) + 1. [Artifacts](#artifacts) + 1. [Scripts](#scripts) + 2. [tMCF Files](#tmcf-files) + 2. [Import Procdeure](#import-procedure) + 3. [Tests](#tests) + +## About the Dataset + +“The Medical Subject Headings (MeSH) thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. It is used for indexing, cataloging, and searching of biomedical and health-related information”. Data Commons includes the Concept, Descriptor, Qualifier, Supplementary Record and Term elements of MeSH as described [here](https://www.nlm.nih.gov/mesh/xml_data_elements.html). More information about the dataset can be found on the official National Center for Biotechnology (NCBI) [website](https://www.ncbi.nlm.nih.gov/mesh/). +Pubchem is one of the largest reservoirs of chemical compound information. It is mapped to many other medical ontologies, including +MeSH. More information about compound IDs and other properties can be found on their official [website](https://pubchemdocs.ncbi.nlm.nih.gov/compounds). + +### Download Data + +All the terminology referenced in the MeSH data can be downloaded in various formats [here](https://www.nlm.nih.gov/databases/download/mesh.html). The current xml files version can also be downloaded by running [`download.sh`](download.sh). For the purpose of mapping all mesh terms with each other, two xml files are used, namely: `desc2022.xml` and `supp2022.xml`. +The csv version of the file containing PubChem Compound ID and names can also be downloaded by running[`download.sh`](download.sh) + +### Overview + +This directory stores the scripts used to convert the xml obtained from the NCBI webpage into five different csv files, each describing the relation between supplementary records, concepts, terms, qualifiers and descriptors, and generating dcids for each. +The MeSH data stores the vocabulary thesaurus used for indexing articles for PubMed. In addition, the scripts are used to map ther PubChem compound IDs to the MeSH descriptor and supplementary record IDs, joining on MeSH supplementary record name/PubChem compoundID. + +- For mapping the MeSH descriptor ID with the MeSH supplementary record ID, the [supplementary file](https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/supp2022.xml) is used. +- For mapping the MeSH descriptor ID with each of the three other IDs: concept ID, term ID, qualifier ID, the [descriptor file](https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2022.xml) is used. +- For mapping the PubChem compound ID with the MeSH supplementary record and descriptor ID, the [pubchem file](https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-MeSH) is used. + +### Notes and Caveats + +The main main file and the mesh supplementary file are both XML formatted. In addition, they're about 300-600 GB worth of storage. This is one the major contributors of extended run time for the scripts. Extracting the information from XML formatted tags and converting it into well-formatted csv involve a lot of computationally heavy steps, which depends on the RAM of the user's system. + +To run the script [`format_mesh.py`](format_mesh.py), the user requires the `mesh-descriptor.xml` file, which outputs five different csv files, each relating to descriptor, concept, qualifier and term. + +To run the script [`format_mesh_record.py`](format_mesh_record.py), the user requires the `mesh_record.xml` file and the `mesh-pubchem.csv` file which maps the record to descriptor and to the pubchem compound ID, and outputs two csv files one for the supplementary records and one with the mappings between mesh and pubchem. Here, the goal is to map the mesh terms to the pubchem database of compounds. This is accomplished by mapping the mesh record name to the pubchem compound name. In addition, each mesh entity has a descriptor ID which is in turn linked to mesh suplementary record ID, and thus ultimately linked to pubchem compound ID. + + +### dcid Generation + +### License + +Any works found on National Library of Medicine (NLM) Web sites may be freely used or reproduced without permission in the U.S. More information about the license can be found [here](https://www.nlm.nih.gov/web_policies.html). + +### Dataset Documentation and Relevant Links + +## About the import + +### Artifacts + +#### Scripts + +##### Bash Scripts + +[`download.sh`](scripts/download.sh) downloads the desc, pa, qual, and supp xml files from MeSH as well as the CID-MeSH mapping file from pubchem. + +[`run.sh`](scripts/run.sh) converts raw data from MeSH into csv files formatted for import into the Data Commons knowledge graph. + +[`tests.sh`](scripts/tests.sh) runs standard tests on CSV + tMCF pairs to check for proper formatting. + +##### Python Scripts + +[`format_mesh_desc.py`](scripts/format_mesh_desc.py) converts the original xml into five formatted csv files, which each can be imported alongside it's matching tMCF. + +[`format_mesh_pan.py`](scripts/format_mesh_pa.py) converts the original csv file into one formatted csv file, which can be imported alongside it's matching tMCF. + +[`format_mesh_qual.py`](scripts/format_mesh_qual.py) converts the original xml into four formatted csv files, which each can be imported alongside it's matching tMCF. + +[`format_mesh_supp.py`](scripts/format_mesh_supp.py) converts the supplementary MeSH supplementary record file into a csv mapped to MeSH descriptor ID, +and it maps the MeSH supplementary records to pubchem compound IDs resulting in a second separate csv. + +#### tMCF Files + +The tMCF files that map each column in the corresponding CSV file to the appropriate property can be found [here](tmcf). They include: + +[`mesh_desc_concept.tmcf`](tMCFs/mesh_desc_concept.tmcf) contains the tmcf mapping to the csv of concept nodes generated from the mesh desc file. + +[`mesh_desc_descriptor.tmcf`](tMCFs/mesh_desc_descriptor.tmcf) contains the tmcf mapping to the csv of descriptor nodes generated from the mesh desc file. + +[`mesh_desc_qualifier.tmcf`](tMCFs/mesh_desc_qualifier.tmcf) contains the tmcf mapping to the csv of qualifier nodes generated from the mesh desc file. + +[`mesh_desc_qualifier_mapping.tmcf`](tMCFs/mesh_desc_qualifier_mapping.tmcf) contains the tmcf mapping to the csv of descriptor qualifier mappings generated from the mesh desc file. + +[`mesh_desc_term.tmcf`](tMCFs/mesh_desc_term.tmcf) contains the tmcf mapping to the csv of term nodes generated from the mesh desc file. + +[`mesh_pharmacological_action.tmcf`](tMCFs/mesh_pharmacological_action.tmcf) contains the tmcf mapping to the csv of pharmacological actions from the mesh pa file. + +[`mesh_pubchem_mapping.tmcf`](tMCFs/mesh_pubchem_mapping.tmcf) contains the tmcf mapping to the csv of pubchem compound CIDs to MeSH Supplementary Records from the `CID-MESH.csv` and the mesh supp file. + +[`mesh_qual_concept.tmcf`](tMCFs/mesh_qual_concept.tmcf) contains the tmcf mapping to the csv of concept nodes generated from the mesh qual file. + +[`mesh_qual_concept_mapping.tmcf`](tMCFs/mesh_qual_concept_mapping.tmcf) contains the tmcf mapping to the csv of mappings of concept nodes to other mesh node types generated from the mesh qual file. + +[`mesh_qual_qualifier.tmcf`](tMCFs/mesh_qual_qualifier.tmcf) contains the tmcf mapping to the csv of qualifier nodes generated from the mesh qual file. + +[`mesh_qual_term.tmcf`](tMCFs/mesh_qual_term.tmcf) contains the tmcf mapping to the csv of term nodes generated from the mesh qual file. + +[`mesh_record.tmcf`](tMCFs/mesh_record.tmcf) ontains the tmcf mapping to the csv of supplementary record nodes generated from the mesh supp file. + +### Import Procedure + +Download the most recent versions of all mesh files (desc, pa, qual, and supp) and the pubchem file that maps CID to MeSH Supplementary Records: + +```bash +sh download.sh +``` + +Generate the cleaned CSVs including splitting into seperate non-coding and coding genes into seperate csv files for each input file: + +```bash +sh run.sh +``` + +### Tests + +Run Data Commons's java -jar import tool to ensure that all schema used in the import is present in the graph, all referenced nodes are present in the graph, along with other warnings. Please note that empty tokens for some columns are expected as this reflects the original data. +To run tests: + +```bash +sh tests.sh +``` + +This will generate an output file for the results of the tests on each csv + tmcf pair From 491eeed2f7a9f27cf663133c105b109defa80767 Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Mon, 11 Mar 2024 13:53:44 -0700 Subject: [PATCH 02/19] Update README.md --- .../Medical_Subject_Headings/README.md | 25 ++++++++++--------- 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md index 0db458bb0a..3af01254c9 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md @@ -18,18 +18,24 @@ ## About the Dataset -“The Medical Subject Headings (MeSH) thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. It is used for indexing, cataloging, and searching of biomedical and health-related information”. Data Commons includes the Concept, Descriptor, Qualifier, Supplementary Record and Term elements of MeSH as described [here](https://www.nlm.nih.gov/mesh/xml_data_elements.html). More information about the dataset can be found on the official National Center for Biotechnology (NCBI) [website](https://www.ncbi.nlm.nih.gov/mesh/). -Pubchem is one of the largest reservoirs of chemical compound information. It is mapped to many other medical ontologies, including -MeSH. More information about compound IDs and other properties can be found on their official [website](https://pubchemdocs.ncbi.nlm.nih.gov/compounds). +“The Medical Subject Headings (MeSH) thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. It is used for indexing, cataloging, and searching of biomedical and health-related information”. Data Commons includes the Concept, Descriptor, Qualifier, Supplementary Concept Record, and Term elements of MeSH as described [here](https://www.nlm.nih.gov/mesh/xml_data_elements.html). More information about the dataset can be found on the official National Center for Biotechnology (NCBI) [website](https://www.ncbi.nlm.nih.gov/mesh/). +Pubchem is one of the largest reservoirs of chemical compound information. It is mapped to many other medical ontologies, including MeSH. More information about compound IDs and other properties can be found on their official [website](https://pubchemdocs.ncbi.nlm.nih.gov/compounds). ### Download Data -All the terminology referenced in the MeSH data can be downloaded in various formats [here](https://www.nlm.nih.gov/databases/download/mesh.html). The current xml files version can also be downloaded by running [`download.sh`](download.sh). For the purpose of mapping all mesh terms with each other, two xml files are used, namely: `desc2022.xml` and `supp2022.xml`. -The csv version of the file containing PubChem Compound ID and names can also be downloaded by running[`download.sh`](download.sh) +All the terminology referenced in the MeSH data can be downloaded in various formats [here](https://www.nlm.nih.gov/databases/download/mesh.html). The current xml files version can also be downloaded by running [`download.sh`](download.sh). To represent the entirity of the MeSH ontology in Biomedical Data Commons we download all for xml files from MeSH: `desc.xml`, `pa.xml`, `qual.xml`, and `supp.xml`. We also download from pubchem the mapping file between pubchem compound ids (CIDs) and corresponding MeSH Descriptor or Supplementary Concept Records MeSH unique ids (`CID-MeSH.csv`). All files required for this import can be downloaded by running[`download.sh`](download.sh) ### Overview -This directory stores the scripts used to convert the xml obtained from the NCBI webpage into five different csv files, each describing the relation between supplementary records, concepts, terms, qualifiers and descriptors, and generating dcids for each. +In this import we use the four MeSH xml files to define MeSH Concept, Descriptor, Qualifiers, Supplementary Concept Records, and Terms as individual nodes as well as maintaining mappings to each other. We also maintain links between these data types to one other as indicated below. Furthermore, + +SCR point to descriptors via parent +terms point to concepts via parent +concepts point to qualifiers via hasMeSHQualifier +concepts point to other concepts via preferredConcept +descriptors point to descriptors via specializationOf +descriptors point to qualifiers via hasMeSHQualifier +concepts point to descriptors via parent The MeSH data stores the vocabulary thesaurus used for indexing articles for PubMed. In addition, the scripts are used to map ther PubChem compound IDs to the MeSH descriptor and supplementary record IDs, joining on MeSH supplementary record name/PubChem compoundID. - For mapping the MeSH descriptor ID with the MeSH supplementary record ID, the [supplementary file](https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/supp2022.xml) is used. @@ -38,12 +44,7 @@ The MeSH data stores the vocabulary thesaurus used for indexing articles for Pub ### Notes and Caveats -The main main file and the mesh supplementary file are both XML formatted. In addition, they're about 300-600 GB worth of storage. This is one the major contributors of extended run time for the scripts. Extracting the information from XML formatted tags and converting it into well-formatted csv involve a lot of computationally heavy steps, which depends on the RAM of the user's system. - -To run the script [`format_mesh.py`](format_mesh.py), the user requires the `mesh-descriptor.xml` file, which outputs five different csv files, each relating to descriptor, concept, qualifier and term. - -To run the script [`format_mesh_record.py`](format_mesh_record.py), the user requires the `mesh_record.xml` file and the `mesh-pubchem.csv` file which maps the record to descriptor and to the pubchem compound ID, and outputs two csv files one for the supplementary records and one with the mappings between mesh and pubchem. Here, the goal is to map the mesh terms to the pubchem database of compounds. This is accomplished by mapping the mesh record name to the pubchem compound name. In addition, each mesh entity has a descriptor ID which is in turn linked to mesh suplementary record ID, and thus ultimately linked to pubchem compound ID. - +The total file size of all original downloaded files for this import is ~1.1 GB. The files from MeSH are in XML format and therefore use the python package `xml.etree.ElementTree` to read these files into pandas dataframes for further processing. Please, note that extracting the information from XML formatted tags and converting it into well-formatted csv involve a lot of computationally heavy steps, which depends on the RAM of the user's system. Please note that special care needs to be given when traversing through the XML tree to ensure that the properties at each level are associated with the appropriate MeSHTerm node type. As part of this process, we ended up making a seperate csv+tmcf pair for each node type from each file with an additional mapping csv+tmcf file pair to bring in mappings between node types as necessary. Finally, we also decided not to include `LexicalTag` or `IsPermutedTermYN` as properties for MeSHTerms from the `qual.xml` file because for all Terms the property value was `NON` or `False` respectively, and thus these properties did not contain any additional information. ### dcid Generation From 5076edf83ae1bd1d77a5e932163e5b69da764f78 Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Mon, 11 Mar 2024 14:26:41 -0700 Subject: [PATCH 03/19] Update README.md --- .../Medical_Subject_Headings/README.md | 46 ++++++++++++------- 1 file changed, 30 insertions(+), 16 deletions(-) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md index 3af01254c9..8243569bef 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md @@ -18,7 +18,7 @@ ## About the Dataset -“The Medical Subject Headings (MeSH) thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. It is used for indexing, cataloging, and searching of biomedical and health-related information”. Data Commons includes the Concept, Descriptor, Qualifier, Supplementary Concept Record, and Term elements of MeSH as described [here](https://www.nlm.nih.gov/mesh/xml_data_elements.html). More information about the dataset can be found on the official National Center for Biotechnology (NCBI) [website](https://www.ncbi.nlm.nih.gov/mesh/). +“The Medical Subject Headings (MeSH) thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. It is used for indexing, cataloging, and searching of biomedical and health-related information”. Data Commons includes the Concept, Descriptor, Qualifier, Supplementary Concept Record, and Term elements of MeSH as described [here](https://www.nlm.nih.gov/mesh/xml_data_elements.html). More information about the dataset can be found on the official National Center for Biotechnology (NCBI) [website](https://www.ncbi.nlm.nih.gov/mesh/). This dataset is updated on an annual basis on the first of January every year. Pubchem is one of the largest reservoirs of chemical compound information. It is mapped to many other medical ontologies, including MeSH. More information about compound IDs and other properties can be found on their official [website](https://pubchemdocs.ncbi.nlm.nih.gov/compounds). ### Download Data @@ -27,26 +27,39 @@ All the terminology referenced in the MeSH data can be downloaded in various for ### Overview -In this import we use the four MeSH xml files to define MeSH Concept, Descriptor, Qualifiers, Supplementary Concept Records, and Terms as individual nodes as well as maintaining mappings to each other. We also maintain links between these data types to one other as indicated below. Furthermore, - -SCR point to descriptors via parent -terms point to concepts via parent -concepts point to qualifiers via hasMeSHQualifier -concepts point to other concepts via preferredConcept -descriptors point to descriptors via specializationOf -descriptors point to qualifiers via hasMeSHQualifier -concepts point to descriptors via parent -The MeSH data stores the vocabulary thesaurus used for indexing articles for PubMed. In addition, the scripts are used to map ther PubChem compound IDs to the MeSH descriptor and supplementary record IDs, joining on MeSH supplementary record name/PubChem compoundID. - -- For mapping the MeSH descriptor ID with the MeSH supplementary record ID, the [supplementary file](https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/supp2022.xml) is used. -- For mapping the MeSH descriptor ID with each of the three other IDs: concept ID, term ID, qualifier ID, the [descriptor file](https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2022.xml) is used. -- For mapping the PubChem compound ID with the MeSH supplementary record and descriptor ID, the [pubchem file](https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-MeSH) is used. +MeSH provides the vocabulary thesaurus used for indexing articles for PubMed. In addition, the scripts are used to map ther PubChem compound IDs to the MeSH descriptor and supplementary concept record. In this import we use the four MeSH xml files to define MeSH Concept, Descriptor, Qualifiers, Supplementary Concept Records, and Terms as individual nodes as well as maintaining mappings to each other. We also maintain links between these data types to one other as indicated below. An overview on the MeSH Record Types can be found (here)[https://www.nlm.nih.gov/mesh/intro_record_types.html]. + +| Node Type | Property | Property Value Range\n(Out Link Node Type) | +| --- | --- | --- | +| MeSHConcept | preferredConcept | MeSHConcept | +| MeSHConcept | parent | MeSHDescriptor | +| MeSHConcept | hasMeSHQualifier | MeSHQualifier | +| MeSHDescpritor | sameAs | ChemicalCompund | +| MeSHDescriptor | mechanismOfAction | MeSHDescriptor | +| MeSHDescriptor | specializationOf | MeSHDescriptor | +| MeSHDescriptor | hasMeSHQualifier | MeSHQualifier | +| MeSHSupplementaryConceptRecord | mechanismOfAction | MeSHDescriptor | +| MeSHSupplementaryConceptRecord | parent | MeSHDescriptor | +| MeSHTerm | parent | MeSHConcept | ### Notes and Caveats The total file size of all original downloaded files for this import is ~1.1 GB. The files from MeSH are in XML format and therefore use the python package `xml.etree.ElementTree` to read these files into pandas dataframes for further processing. Please, note that extracting the information from XML formatted tags and converting it into well-formatted csv involve a lot of computationally heavy steps, which depends on the RAM of the user's system. Please note that special care needs to be given when traversing through the XML tree to ensure that the properties at each level are associated with the appropriate MeSHTerm node type. As part of this process, we ended up making a seperate csv+tmcf pair for each node type from each file with an additional mapping csv+tmcf file pair to bring in mappings between node types as necessary. Finally, we also decided not to include `LexicalTag` or `IsPermutedTermYN` as properties for MeSHTerms from the `qual.xml` file because for all Terms the property value was `NON` or `False` respectively, and thus these properties did not contain any additional information. +The `pa.xml` file provided information on the pharmalogical action or mechanismOfAction of MeSHDescriptor and MeSHSupplementaryConceptRecord nodes. This provides pharmacological information about a subset of applicable MeSH records. Therefore, for MeSHDescriptor and MeSHSupplementaryConceptRecord nodes that were included in the `pa.xml` as having mechanismOfAction that are connected MeShDescriptor nodes, we noted that these nodes were of Drug node type as well. + ### dcid Generation +The dcids for all MeShRecordType nodes (MeSHConcept, MeSHDescriptor, MeSHQualifier, MeSHSupplementaryConceptRecord, and MeSHTerm) are generated using the mesh unique ids with the bio prefix: `bio/`. For MeSH unique ids they are formatted as starting with a letter followed by a string of numbers with the identity of the starting letter indicating the MeSH record type. The mapping of MeSH record type by the first letter of its unique ID is indicated below. In addition to using the MeSH unique ID to generate the dcid, the unique id is recorded as the value of the `identifier` property for all MeSHRecordType nodes. + +| Node Type | Starting Letter for MeSH unique ID | +| --- | --- | +| MeSHConcept | M | +| MeSHDescriptor | D | +| MeSHQualifier | Q | +| MeSHSupplementaryConceptRecord | C | +| MeSHTerm | T | + +The dcids for ChemicalCompounds were generated using the PubChem compound ID with the chem prefix: `chem/CID` the PubChem Compound ID provided by PubChem is a string of numbers, therefore we added the specifier to the front of this id as part of the dcid to provide context. The PubChem Compound ID is also seperately stored as a string value to property `pubChemCompoundID`. ### License @@ -123,7 +136,8 @@ sh run.sh ### Tests -Run Data Commons's java -jar import tool to ensure that all schema used in the import is present in the graph, all referenced nodes are present in the graph, along with other warnings. Please note that empty tokens for some columns are expected as this reflects the original data. +The first step of `tests.sh` is to downloads Data Commons's java -jar import tool, storing it in a `tmp` directory. This assumes that the user has Java Runtime Environment (JRE) installed. This tool is described in Data Commons documentation of the [import pipeline](https://github.com/datacommonsorg/import/). The relases of the tool can be viewed [here](https://github.com/datacommonsorg/import/releases/). Here we download version `0.1-alpha.1k` and apply it to check our csv + tmcf import. It evaluates if all schema used in the import is present in the graph, all referenced nodes are present in the graph, along with other checks that issue fatal errors, errors, or warnings upon failing checks. Please note that empty tokens for some columns are expected as this reflects the original data. All referenced nodes are created as part of the same csv+tmcf import pair, therefore any Existence Missing Reference warnings can be ignored. + To run tests: ```bash From e7f0cd858534d4c1b8b625c08c613b8de79749e2 Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Mon, 11 Mar 2024 14:54:54 -0700 Subject: [PATCH 04/19] Update README.md update tmcf files in the import --- .../NIH_NLM/Medical_Subject_Headings/README.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md index 8243569bef..0e6dbb17a3 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md @@ -44,7 +44,7 @@ MeSH provides the vocabulary thesaurus used for indexing articles for PubMed. In ### Notes and Caveats -The total file size of all original downloaded files for this import is ~1.1 GB. The files from MeSH are in XML format and therefore use the python package `xml.etree.ElementTree` to read these files into pandas dataframes for further processing. Please, note that extracting the information from XML formatted tags and converting it into well-formatted csv involve a lot of computationally heavy steps, which depends on the RAM of the user's system. Please note that special care needs to be given when traversing through the XML tree to ensure that the properties at each level are associated with the appropriate MeSHTerm node type. As part of this process, we ended up making a seperate csv+tmcf pair for each node type from each file with an additional mapping csv+tmcf file pair to bring in mappings between node types as necessary. Finally, we also decided not to include `LexicalTag` or `IsPermutedTermYN` as properties for MeSHTerms from the `qual.xml` file because for all Terms the property value was `NON` or `False` respectively, and thus these properties did not contain any additional information. +The total file size of all original downloaded files for this import is ~1.1 GB. The files from MeSH are in XML format and therefore use the python package `xml.etree.ElementTree` to read these files into pandas dataframes for further processing. Please, note that extracting the information from XML formatted tags and converting it into well-formatted csv involve a lot of computationally heavy steps, which depends on the RAM of the user's system. Please note that special care needs to be given when traversing through the XML tree to ensure that the properties at each level are associated with the appropriate MeSHTerm node type. As part of this process, we ended up making a seperate csv+tmcf pair for each node type from each file with an additional mapping csv+tmcf file pair to bring in mappings between node types as necessary. The total file size for all fourteen formatted csvs is ~135 MB. Finally, we also decided not to include `LexicalTag` or `IsPermutedTermYN` as properties for MeSHTerms from the `qual.xml` file because for all Terms the property value was `NON` or `False` respectively, and thus these properties did not contain any additional information. The `pa.xml` file provided information on the pharmalogical action or mechanismOfAction of MeSHDescriptor and MeSHSupplementaryConceptRecord nodes. This provides pharmacological information about a subset of applicable MeSH records. Therefore, for MeSHDescriptor and MeSHSupplementaryConceptRecord nodes that were included in the `pa.xml` as having mechanismOfAction that are connected MeShDescriptor nodes, we noted that these nodes were of Drug node type as well. @@ -83,9 +83,9 @@ Any works found on National Library of Medicine (NLM) Web sites may be freely us ##### Python Scripts -[`format_mesh_desc.py`](scripts/format_mesh_desc.py) converts the original xml into five formatted csv files, which each can be imported alongside it's matching tMCF. +[`format_mesh_desc.py`](scripts/format_mesh_desc.py) converts the original xml into six formatted csv files, which each can be imported alongside it's matching tMCF. -[`format_mesh_pan.py`](scripts/format_mesh_pa.py) converts the original csv file into one formatted csv file, which can be imported alongside it's matching tMCF. +[`format_mesh_pan.py`](scripts/format_mesh_pa.py) converts the original csv file into two formatted csv files, which can be imported alongside it's matching tMCF. [`format_mesh_qual.py`](scripts/format_mesh_qual.py) converts the original xml into four formatted csv files, which each can be imported alongside it's matching tMCF. @@ -100,13 +100,17 @@ The tMCF files that map each column in the corresponding CSV file to the appropr [`mesh_desc_descriptor.tmcf`](tMCFs/mesh_desc_descriptor.tmcf) contains the tmcf mapping to the csv of descriptor nodes generated from the mesh desc file. +[`mesh_desc_descriptor_mapping.tmcf`](tMCFs/mesh_desc_descriptor_mapping.tmcf) contains the tmcf mapping to the csv of mapping of descriptor nodes to parent (more general) descriptor nodes from the mesh desc file. + [`mesh_desc_qualifier.tmcf`](tMCFs/mesh_desc_qualifier.tmcf) contains the tmcf mapping to the csv of qualifier nodes generated from the mesh desc file. [`mesh_desc_qualifier_mapping.tmcf`](tMCFs/mesh_desc_qualifier_mapping.tmcf) contains the tmcf mapping to the csv of descriptor qualifier mappings generated from the mesh desc file. [`mesh_desc_term.tmcf`](tMCFs/mesh_desc_term.tmcf) contains the tmcf mapping to the csv of term nodes generated from the mesh desc file. -[`mesh_pharmacological_action.tmcf`](tMCFs/mesh_pharmacological_action.tmcf) contains the tmcf mapping to the csv of pharmacological actions from the mesh pa file. +[`mesh_pharmacological_action_descriptor.tmcf`](tMCFs/mesh_pharmacological_action.tmcf) contains the tmcf mapping to the csv of pharmacological actions of mesh descriptors to the appropriate mesh descriptor nodes from the mesh pa file. + +[`mesh_pharmacological_action_record.tmcf`](tMCFs/mesh_pharmacological_action.tmcf) contains the tmcf mapping to the csv of pharmacological actions of mesh supplementary concept records to the appropriate mesh descriptor nodes from the mesh pa file. [`mesh_pubchem_mapping.tmcf`](tMCFs/mesh_pubchem_mapping.tmcf) contains the tmcf mapping to the csv of pubchem compound CIDs to MeSH Supplementary Records from the `CID-MESH.csv` and the mesh supp file. From 7f2443936616b20efff552c9974f40163096bad3 Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Mon, 11 Mar 2024 15:44:30 -0700 Subject: [PATCH 05/19] Update README.md --- .../NIH_NLM/Medical_Subject_Headings/README.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md index 0e6dbb17a3..efb6857c72 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md @@ -44,7 +44,7 @@ MeSH provides the vocabulary thesaurus used for indexing articles for PubMed. In ### Notes and Caveats -The total file size of all original downloaded files for this import is ~1.1 GB. The files from MeSH are in XML format and therefore use the python package `xml.etree.ElementTree` to read these files into pandas dataframes for further processing. Please, note that extracting the information from XML formatted tags and converting it into well-formatted csv involve a lot of computationally heavy steps, which depends on the RAM of the user's system. Please note that special care needs to be given when traversing through the XML tree to ensure that the properties at each level are associated with the appropriate MeSHTerm node type. As part of this process, we ended up making a seperate csv+tmcf pair for each node type from each file with an additional mapping csv+tmcf file pair to bring in mappings between node types as necessary. The total file size for all fourteen formatted csvs is ~135 MB. Finally, we also decided not to include `LexicalTag` or `IsPermutedTermYN` as properties for MeSHTerms from the `qual.xml` file because for all Terms the property value was `NON` or `False` respectively, and thus these properties did not contain any additional information. +The total file size of all original downloaded files for this import is ~1.1 GB. The files from MeSH are in XML format and therefore use the python package `xml.etree.ElementTree` to read these files into pandas dataframes for further processing. Please, note that extracting the information from XML formatted tags and converting it into well-formatted csv involve a lot of computationally heavy steps, which depends on the RAM of the user's system. Please note that special care needs to be given when traversing through the XML tree to ensure that the properties at each level are associated with the appropriate MeSHTerm node type. As part of this process, we ended up making a seperate csv+tmcf pair for each node type from each file with an additional mapping csv+tmcf file pair to bring in mappings between node types as necessary. The total file size for all sixteen formatted csvs is ~135 MB. Finally, we also decided not to include `LexicalTag` or `IsPermutedTermYN` as properties for MeSHTerms from the `qual.xml` file because for all Terms the property value was `NON` or `False` respectively, and thus these properties did not contain any additional information. The `pa.xml` file provided information on the pharmalogical action or mechanismOfAction of MeSHDescriptor and MeSHSupplementaryConceptRecord nodes. This provides pharmacological information about a subset of applicable MeSH records. Therefore, for MeSHDescriptor and MeSHSupplementaryConceptRecord nodes that were included in the `pa.xml` as having mechanismOfAction that are connected MeShDescriptor nodes, we noted that these nodes were of Drug node type as well. @@ -83,7 +83,7 @@ Any works found on National Library of Medicine (NLM) Web sites may be freely us ##### Python Scripts -[`format_mesh_desc.py`](scripts/format_mesh_desc.py) converts the original xml into six formatted csv files, which each can be imported alongside it's matching tMCF. +[`format_mesh_desc.py`](scripts/format_mesh_desc.py) converts the original xml into eight formatted csv files, which each can be imported alongside it's matching tMCF. [`format_mesh_pan.py`](scripts/format_mesh_pa.py) converts the original csv file into two formatted csv files, which can be imported alongside it's matching tMCF. @@ -98,16 +98,20 @@ The tMCF files that map each column in the corresponding CSV file to the appropr [`mesh_desc_concept.tmcf`](tMCFs/mesh_desc_concept.tmcf) contains the tmcf mapping to the csv of concept nodes generated from the mesh desc file. +[`mesh_desc_concept_mapping.tmcf`](tMCFs/mesh_desc_concept_mapping.tmcf) contains the tmcf mapping to the csv of the links of concept nodes to descriptor nodes generated from the mesh desc file. + [`mesh_desc_descriptor.tmcf`](tMCFs/mesh_desc_descriptor.tmcf) contains the tmcf mapping to the csv of descriptor nodes generated from the mesh desc file. -[`mesh_desc_descriptor_mapping.tmcf`](tMCFs/mesh_desc_descriptor_mapping.tmcf) contains the tmcf mapping to the csv of mapping of descriptor nodes to parent (more general) descriptor nodes from the mesh desc file. +[`mesh_desc_descriptor_mapping.tmcf`](tMCFs/mesh_desc_descriptor_mapping.tmcf) contains the tmcf mapping to the csv of descriptor nodes liks to parent (more general) descriptor nodes from the mesh desc file. [`mesh_desc_qualifier.tmcf`](tMCFs/mesh_desc_qualifier.tmcf) contains the tmcf mapping to the csv of qualifier nodes generated from the mesh desc file. -[`mesh_desc_qualifier_mapping.tmcf`](tMCFs/mesh_desc_qualifier_mapping.tmcf) contains the tmcf mapping to the csv of descriptor qualifier mappings generated from the mesh desc file. +[`mesh_desc_qualifier_mapping.tmcf`](tMCFs/mesh_desc_qualifier_mapping.tmcf) contains the tmcf mapping to the csv of desciptor nodes links to qualifier nodes generated from the mesh desc file. [`mesh_desc_term.tmcf`](tMCFs/mesh_desc_term.tmcf) contains the tmcf mapping to the csv of term nodes generated from the mesh desc file. +[`mesh_desc_term_mapping.tmcf`](tMCFs/mesh_desc_term_mapping.tmcf) contains the tmcf mapping to the csv of the mappings of term nodes links to concept nodes from the mesh desc file. + [`mesh_pharmacological_action_descriptor.tmcf`](tMCFs/mesh_pharmacological_action.tmcf) contains the tmcf mapping to the csv of pharmacological actions of mesh descriptors to the appropriate mesh descriptor nodes from the mesh pa file. [`mesh_pharmacological_action_record.tmcf`](tMCFs/mesh_pharmacological_action.tmcf) contains the tmcf mapping to the csv of pharmacological actions of mesh supplementary concept records to the appropriate mesh descriptor nodes from the mesh pa file. From b58e5a5c70e18ff413acfdbe020d4f3cfbaf4a3c Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Mon, 11 Mar 2024 16:28:01 -0700 Subject: [PATCH 06/19] Add tmcf files --- .../tMCFs/mesh_desc_concept.tmcf | 7 +++++++ .../tMCFs/mesh_desc_concept_mapping.tmcf | 8 ++++++++ .../tMCFs/mesh_desc_descriptor.tmcf | 11 +++++++++++ .../tMCFs/mesh_desc_descriptor_mapping.tmcf | 8 ++++++++ .../tMCFs/mesh_desc_qualifier.tmcf | 6 ++++++ .../tMCFs/mesh_desc_qualifier_mapping.tmcf | 10 ++++++++++ .../tMCFs/mesh_desc_term.tmcf | 5 +++++ .../tMCFs/mesh_desc_term_mapping.tmcf | 9 +++++++++ .../mesh_pharmacological_action_descriptor.tmcf | 14 ++++++++++++++ .../tMCFs/mesh_pharmacological_action_record.tmcf | 12 ++++++++++++ .../tMCFs/mesh_pubchem_mapping.tmcf | 11 +++++++++++ .../tMCFs/mesh_qual_concept.tmcf | 13 +++++++++++++ .../tMCFs/mesh_qual_concept_mapping.tmcf | 8 ++++++++ .../tMCFs/mesh_qual_qualifier.tmcf | 11 +++++++++++ .../tMCFs/mesh_qual_term.tmcf | 15 +++++++++++++++ .../tMCFs/mesh_record.tmcf | 13 +++++++++++++ 16 files changed, 161 insertions(+) create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept.tmcf create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept_mapping.tmcf create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor.tmcf create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor_mapping.tmcf create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_qualifier.tmcf create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_qualifier_mapping.tmcf create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_term.tmcf create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_term_mapping.tmcf create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_descriptor.tmcf create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_record.tmcf create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pubchem_mapping.tmcf create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept_mapping.tmcf create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_qualifier.tmcf create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_record.tmcf diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept.tmcf new file mode 100644 index 0000000000..10525e0b22 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept.tmcf @@ -0,0 +1,7 @@ +Node: E:mesh_desc_concept->E1 +typeOf: dcs:MeSHConcept +dcid: C:mesh_desc_concept->Concept_dcid +name: C:mesh_desc_concept->ConceptName +description: C:mesh_desc_concept->ScopeNote +identifier: C:mesh_desc_concept->ConceptID + diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept_mapping.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept_mapping.tmcf new file mode 100644 index 0000000000..cd44c39840 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept_mapping.tmcf @@ -0,0 +1,8 @@ +Node: E:mesh_desc_concept->E1 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_desc_concept->Descriptor_dcid + +Node: E:mesh_desc_concept->E2 +typeOf: dcs:MeSHConcept +dcid: C:mesh_desc_concept->Concept_dcid +parent: E:mesh_desc_concept->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor.tmcf new file mode 100644 index 0000000000..8283498f26 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor.tmcf @@ -0,0 +1,11 @@ +Node: E:mesh_desc_descriptor->E1 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_desc_descriptor->Descriptor_dcid +name: C:mesh_desc_descriptor->DescriptorName +dateCreated: C:mesh_desc_descriptor->DateCreated +dateRevised: C:mesh_desc_descriptor->DateRevised +dateEstablished: C:mesh_desc_descriptor->DateEstablished +description: C:mesh_desc_descriptor->ScopeNote +identifier: C:mesh_desc_descriptor->DescriptorID +medicalSubjectHeadingTreeNumber: C:mesh_desc_descriptor->TreeNumber +nationalLibraryOfMedicineClassificationNumber: C:mesh_desc_descriptor->NLMClassificationNumber diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor_mapping.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor_mapping.tmcf new file mode 100644 index 0000000000..06124cb138 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor_mapping.tmcf @@ -0,0 +1,8 @@ +Node: E:mesh_desc_descriptor->E1 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_desc_descriptor->DescriptorParentID + +Node: E:mesh_desc_descriptor->E2 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_desc_descriptor->Descriptor_dcid +specializationOf: E:mesh_desc_descriptor->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_qualifier.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_qualifier.tmcf new file mode 100644 index 0000000000..49bb698c79 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_qualifier.tmcf @@ -0,0 +1,6 @@ +Node: E:mesh_desc_qualifier->E1 +typeOf: dcs:MeSHQualifier +dcid: C:mesh_desc_qualifier->Qualifier_dcid +name: C:mesh_desc_qualifier->QualifierName +identifier: C:mesh_desc_qualifier->QualifierID +abbreviation: C:mesh_desc_qualifier->QualifierAbbreviation diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_qualifier_mapping.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_qualifier_mapping.tmcf new file mode 100644 index 0000000000..96c863be34 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_qualifier_mapping.tmcf @@ -0,0 +1,10 @@ +Node: E:mesh_desc_descriptor_qualifier_mapping->E1 +typeOf: dcs:MeSHQualifier +dcid: C:mesh_desc_descriptor_qualifier_mapping->Qualifier_dcid +identifier: C:mesh_desc_descriptor_qualifier_mapping->QualifierID + +Node: E:mesh_desc_descriptor_qualifier_mapping->E2 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_desc_descriptor_qualifier_mapping->Descriptor_dcid +identifier: C:mesh_desc_descriptor_qualifier_mapping->DescriptorID +hasMeSHQualifier: E:mesh_desc_descriptor_qualifier_mapping->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_term.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_term.tmcf new file mode 100644 index 0000000000..e1d1dbb173 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_term.tmcf @@ -0,0 +1,5 @@ +Node: E:mesh_desc_term->E2 +typeOf: dcs:MeSHTerm +dcid: C:mesh_desc_term->Term_dcid +name: C:mesh_desc_term->TermName +identifier: C:mesh_desc_term->TermID diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_term_mapping.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_term_mapping.tmcf new file mode 100644 index 0000000000..a9c4494b6d --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_term_mapping.tmcf @@ -0,0 +1,9 @@ +Node: E:mesh_desc_term->E1 +typeOf: dcs:MeSHConcept +dcid: C:mesh_desc_term->Concept_dcid +identifier: C:mesh_desc_term->ConceptID + +Node: E:mesh_desc_term->E2 +typeOf: dcs:MeSHTerm +dcid: C:mesh_desc_term->Term_dcid +parent: E:mesh_desc_term->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_descriptor.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_descriptor.tmcf new file mode 100644 index 0000000000..712948c4ee --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_descriptor.tmcf @@ -0,0 +1,14 @@ +Node: E:mesh_pharmacological_action_descriptor->E1 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_pharmacological_action_descriptor->Descriptor_dcid +name: C:mesh_pharmacological_action_descriptor->DescriptorName +identifier: C:mesh_pharmacological_action_descriptor->DescriptorUI + +Node: E:mesh_pharmacological_action_descriptor->E2 +typeOf: dcs:MeSHDescriptor +typeOf: schema:Drug +dcid: C:mesh_pharmacological_action_descriptor->dcid +name: C:mesh_pharmacological_action_descriptor->RecordName +identifier: C:mesh_pharmacological_action_descriptor->RecordUI +mechanismOfAction: E:mesh_pharmacological_action_descriptor->E1 + \ No newline at end of file diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_record.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_record.tmcf new file mode 100644 index 0000000000..f915fca71b --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_record.tmcf @@ -0,0 +1,12 @@ +Node: E:mesh_pharmacological_action_record->E1 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_pharmacological_action_record->Descriptor_dcid +identifier: C:mesh_pharmacological_action_record->DescriptorUI + +Node: E:mesh_pharmacological_action_record->E2 +typeOf: dcs:MeSHSupplementaryConceptRecord +typeOf: schema:Drug +dcid: C:mesh_pharmacological_action_record->dcid +name: C:mesh_pharmacological_action_record->RecordName +identifier: C:mesh_pharmacological_action_record->RecordUI +mechanismOfAction: E:mesh_pharmacological_action_record->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pubchem_mapping.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pubchem_mapping.tmcf new file mode 100644 index 0000000000..1bd6d1a962 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pubchem_mapping.tmcf @@ -0,0 +1,11 @@ +Node: E:mesh_pubchem_mapping->E1 +typeOf: dcs:ChemicalCompound +dcid: C:mesh_pubchem_mapping->CID_dcid +pubChemCompoundID: C:mesh_pubchem_mapping->CID + +Node: E:mesh_pubchem_mapping->E2 +typeOf: dcs:MeSHSupplementaryRecord +dcid: C:mesh_pubchem_mapping->Record_dcid +name: C:mesh_pubchem_mapping->RecordName +identifier: C:mesh_pubchem_mapping->RecordID +sameAs: E:mesh_pubchem_mapping->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf new file mode 100644 index 0000000000..e46cff01b5 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf @@ -0,0 +1,13 @@ +Node: E:mesh_qual_concept->E1 +typeOf: dcs:MeSHQualifier +dcid: C:mesh_qual_concept->Qualifier_dcid +identifier: C:mesh_qual_concept->QualifierUI + +Node: E:mesh_qual_concept->E2 +typeOf: dcs:MeSHConcept +dcid: C:mesh_qual_concept->Concept_dcid +name: C:mesh_qual_concept->ConceptName +description: C:mesh_qual_concept->ScopeNote +identifier: C:mesh_qual_concept->ConceptUI +isPreferredConcept: C:mesh_qual_concept->IsPreferredConcept +hasMeSHQualifer: E:mesh_qual_concept->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept_mapping.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept_mapping.tmcf new file mode 100644 index 0000000000..9bf2fb73e3 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept_mapping.tmcf @@ -0,0 +1,8 @@ +Node: E:mesh_qual_concept_mapping->E1 +typeOf: dcs:MeSHConcept +dcid: C:mesh_qual_concept_mapping->Preferred_Concept_dcid + +Node: E:mesh_qual_concept_mapping->E2 +typeOf: dcs:MeSHConcept +dcid: C:mesh_qual_concept_mapping->Concept_dcid +preferredMeSHConcept: E:mesh_qual_concept_mapping->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_qualifier.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_qualifier.tmcf new file mode 100644 index 0000000000..cb2d81c7e8 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_qualifier.tmcf @@ -0,0 +1,11 @@ +Node: E:mesh_qual_qualifier->E1 +typeOf: dcs:MeSHQualifier +dcid: C:mesh_qual_qualifier->Qualifier_dcid +name: C:mesh_qual_qualifier->QualifierName +dateCreated: C:mesh_qual_qualifier->DateCreated +dateRevised: C:mesh_qual_qualifier->DateRevised +dateEstablished: C:mesh_qual_qualifier->DateEstablished +description: C:mesh_qual_qualifier->Annotation +identifier: C:mesh_qual_qualifier->QualifierUI +note: C:mesh_qual_qualifier->HistoryNote +medicalSubjectHeadingTreeNumber: C:mesh_qual_qualifier->TreeNumber diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf new file mode 100644 index 0000000000..753cca26ba --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf @@ -0,0 +1,15 @@ +Node: E:mesh_qual_term->E1 +typeOf: dcs:MeSHConcept +identifier: C:mesh_qual_term->ConceptUI + +Node: E:mesh_qual_term->E2 +typeOf: dcs:MeSHTerm +dcid: C:mesh_qual_term->Concept_dcid +name: C:mesh_qual_term->TermName +abbreviation: C:mesh_qual_term->Abbreviation +abbreviation: C:mesh_qual_term->Display +dateCreated: C:mesh_qual_term->DateCreated +identifier: C:mesh_qual_term->TermUI +isConceptPreferedTerm: C:mesh_qual_term->is_concept_preferred_term +isRecordPreferedTerm: C:mesh_qual_term->is_record_preferred_term +parent: E:mesh_qual_term->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_record.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_record.tmcf new file mode 100644 index 0000000000..19970af422 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_record.tmcf @@ -0,0 +1,13 @@ +Node: E:mesh_record->E1 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_record->Descriptor_dcid +identifier: C:mesh_record->DescriptorID + +Node: E:mesh_record->E2 +typeOf: dcs:MeSHSupplementaryConcept +dcid: C:mesh_record->Record_dcid +identifier: C:mesh_record->RecordID +name: C:mesh_record->RecordName +dateCreated: C:mesh_record->DateCreated +dateRevised: C:mesh_record->DateRevised +parent: E:mesh_record->E1 From 893b2c2837eaad96868d910b1d927e97259e16dc Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Mon, 11 Mar 2024 16:39:37 -0700 Subject: [PATCH 07/19] Add files via upload --- .../scripts/download.sh | 18 + .../scripts/format_mesh_desc.py | 410 ++++++++++++++++++ .../scripts/format_mesh_pa.py | 161 +++++++ .../scripts/format_mesh_qual.py | 314 ++++++++++++++ .../scripts/format_mesh_supp.py | 174 ++++++++ .../Medical_Subject_Headings/scripts/run.sh | 19 + .../Medical_Subject_Headings/scripts/tests.sh | 85 ++++ 7 files changed, 1181 insertions(+) create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/download.sh create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_desc.py create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_pa.py create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_qual.py create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_supp.py create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/run.sh create mode 100644 scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/tests.sh diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/download.sh b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/download.sh new file mode 100644 index 0000000000..d4c0de277c --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/download.sh @@ -0,0 +1,18 @@ +#!/bin/bash + +mkdir -p input; cd input + +# downloads the mesh xml file +curl -o mesh-desc.xml https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2024.xml + +# downloads the mesh pharmacological action xml file +curl -o mesh-pa.xml https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/pa2024.xml + +# downloads the mesh qualifier xml file +curl -o mesh-qual.xml https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/qual2024.xml + +# downloads the mesh record xml file +curl -o mesh-supp.xml https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/supp2024.xml + +# downloads the pubchem compound ID and name csv file +curl -o mesh-pubchem.csv https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-MeSH diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_desc.py b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_desc.py new file mode 100644 index 0000000000..0827920488 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_desc.py @@ -0,0 +1,410 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +''' +Author: Suhana Bedi +Date: 09/17/2021 +Name: format_mesh_desc.py +Edited By: Samantha Piekos +Last Modified: 03/11/24 +Description: converts nested .xml to .csv and further breaks down the csv +into five different csvs, each describing relations between terms, qualifiers, +descriptors and concepts with an additional file mapping descriptors to +qualifiers. +@file_input: input .xml downloaded from NCBI +@file_output: five formatted csv files ready for import into data commons kg + with corresponding tmcf file +''' + + +# set up environment +import sys +import pandas as pd +import numpy as np +from xml.etree.ElementTree import parse + + +# declare universal variables +FILEPATH_MESH_CONCEPT = 'CSVs/mesh_desc_concept.csv' +FILEPATH_MESH_CONCEPT_MAPPING = 'CSVs/mesh_desc_concept_mapping.csv' +FILEPATH_MESH_DESCRIPTOR = 'CSVs/mesh_desc_descriptor.csv' +FILEPATH_MESH_DESCRIPTOR_MAPPING = 'CSVs/mesh_desc_descriptor_mapping.csv' +FILEPATH_MESH_QUALIFIER = 'CSVs/mesh_desc_qualifier.csv' +FILEPATH_MESH_QUALIFIER_MAPPING = 'CSVs/mesh_desc_qualifier_mapping.csv' +FILEPATH_MESH_TERM = 'CSVs/mesh_desc_term.csv' +FILEPATH_MESH_TERM_MAPPING = 'CSVs/mesh_desc_term_mapping.csv' + + +def format_mesh_xml(mesh_xml): + """ + Parses the xml file and converts it to a csv with + required columns + Args: + mesh_xml = xml file to be parsed + Returns: + pandas df after parsing + """ + document = parse(mesh_xml) + d = [] + ## column names for parsed xml tags + dfcols = [ + 'DescriptorID', 'DescriptorName', 'DateCreated-Year', + 'DateCreated-Month', 'DateCreated-Day', 'DateRevised-Year', + 'DateRevised-Month', 'DateRevised-Day', 'DateEstablished-Year', + 'DateEstablished-Month', 'DateEstablished-Day', 'QualifierID', + 'QualifierName', 'QualifierAbbreviation', 'ConceptID', 'ConceptName', + 'ScopeNote', 'TermID', 'TermName', 'TreeNumber', 'NLMClassificationNumber' + ] + df = pd.DataFrame(columns=dfcols) + for item in document.iterfind('DescriptorRecord'): + ## parses the Descriptor ID + d1 = item.findtext('DescriptorUI') + ## parses the Descriptor Name + elem = item.find(".//DescriptorName") + d1_name = elem.findtext("String") + ## parses the Date of Creation + date_created = item.find(".//DateCreated") + if date_created is None: + d1_created_year = np.nan + d1_created_month = np.nan + d1_created_day = np.nan + else: + d1_created_year = date_created.findtext("Year") + d1_created_month = date_created.findtext("Month") + d1_created_day = date_created.findtext("Day") + ## parses the Date of Revision + date_revised = item.find(".//DateRevised") + if date_revised is None: + d1_revised_year = np.nan + d1_revised_month = np.nan + d1_revised_day = np.nan + else: + d1_revised_year = date_revised.findtext("Year") + d1_revised_month = date_revised.findtext("Month") + d1_revised_day = date_revised.findtext("Day") + ## parses the Date of Establishment + date_established = item.find(".//DateEstablished") + if date_established is None: + d1_established_year = np.nan + d1_established_month = np.nan + d1_established_day = np.nan + else: + d1_established_year = date_established.findtext("Year") + d1_established_month = date_established.findtext("Month") + d1_established_day = date_established.findtext("Day") + tree_list = item.find(".//TreeNumberList") + if tree_list is None: + tree_num = np.nan + else: + tree_num = [] + for i in range(len(tree_list)): + ## parses the Tree Number + tree_num.append(tree_list.findtext("TreeNumber")) + ## parses the NLM Classification Number + nlm_num = item.findtext("NLMClassificationNumber") + if nlm_num is None: + nlm_num = np.nan + quantifier_list = item.find(".//AllowableQualifiersList") + qualID = [] + qual_name = [] + qual_abbr = [] + if quantifier_list is None: + qualID.append(np.nan) + qual_name.append(np.nan) + qual_abbr.append(np.nan) + else: + l1 = quantifier_list.findall(".//AllowableQualifier") + for i in range(len(l1)): + l2 = l1[i].find(".//QualifierReferredTo") + ## parses the Qualifier ID + qualID.append(l2.findtext("QualifierUI")) + ## parses the Qualifier Name + l3 = l2.find(".//QualifierName") + qual_name.append(l3.findtext("String")) + ## parses the Qualifier Abbreviation + qual_abbr.append(l1[i].findtext("Abbreviation")) + + concept_list = item.find(".//ConceptList") + if concept_list is None: + conceptID = np.nan + conceptName = np.nan + scopeNote = np.nan + termUI = np.nan + termName = np.nan + else: + c1 = concept_list.findall(".//Concept") + conceptID = [] + conceptName = [] + scopeNote = [] + termUI = [] + termName = [] + for i in range(len(c1)): + ## parses the Concept ID + conceptID.append(c1[i].findtext("ConceptUI")) + ## parses the Scope Note + scopeNote.append(c1[i].findtext("ScopeNote")) + ## parses the Concept Name + c2 = c1[i].find(".//ConceptName") + conceptName.append(c2.findtext("String")) + c3 = c1[i].find(".//TermList") + c4 = c3.findall(".//Term") + subtermUI = [] + subtermName = [] + for j in range(len(c4)): + ## parses the Term ID + subtermUI.append(c4[j].findtext("TermUI")) + subtermName.append(c4[j].findtext("String")) + termUI.append(subtermUI) + termName.append(subtermName) + d.append({'DescriptorID':d1, 'DescriptorName':d1_name, 'DateCreated-Year':d1_created_year, +'DateCreated-Month':d1_created_month, 'DateCreated-Day':d1_created_day, 'DateRevised-Year':d1_revised_year, +'DateRevised-Month':d1_revised_month, 'DateRevised-Day':d1_revised_day, 'DateEstablished-Year':d1_established_year, +'DateEstablished-Month':d1_established_month, 'DateEstablished-Day':d1_established_day, +'QualifierID':qualID, 'QualifierName':qual_name, 'QualifierAbbreviation':qual_abbr, +'ConceptID':conceptID, 'ConceptName':conceptName, 'ScopeNote':scopeNote, 'TermID':termUI, +'TermName':termName, 'TreeNumber':tree_num, 'NLMClassificationNumber':nlm_num}) + + df = pd.DataFrame(d) + return df + + +def date_modify(df): + """ + Modifies the dates in a df, into an ISO format + Args: + df = df with date columns + Returns: + df with modified date columns + + """ + df['DateCreated'] = df['DateCreated-Year'].astype( + str) + "-" + df['DateCreated-Month'].astype( + str) + "-" + df['DateCreated-Day'].astype(str) + df['DateRevised'] = df['DateRevised-Year'].astype( + str) + "-" + df['DateRevised-Month'].astype( + str) + "-" + df['DateRevised-Day'].astype(str) + df['DateEstablished'] = df['DateEstablished-Year'].astype( + str) + "-" + df['DateEstablished-Month'].astype( + str) + "-" + df['DateEstablished-Day'].astype(str) + ## adds quotes from text type columns and replaces "nan" with np.nan + col_names_quote = ['DateCreated', 'DateRevised', 'DateEstablished'] + for col in col_names_quote: + df[col] = df[col].replace(["nan-nan-nan"],np.nan) + ## drop repetitive column values + df = df.drop(columns=[ + 'DateCreated-Year', 'DateCreated-Month', 'DateCreated-Day', + 'DateRevised-Year', 'DateRevised-Month', 'DateRevised-Day', + 'DateEstablished-Year', 'DateEstablished-Month', 'DateEstablished-Day' + ]) + return df + + +def is_not_none(x): + # check if value exists + if pd.isna(x): + return False + return True + + +def format_text_strings(df, col_names): + """ + Converts missing values to numpy nan value and adds outside quotes + to strings (excluding np.nan). Applies change to columns specified in col_names. + """ + + for col in col_names: + df[col] = df[col].str.rstrip() # Remove trailing whitespace + df[col] = df[col].replace([''],np.nan) # replace missing values with np.nan + + # Quote only string values + mask = df[col].apply(is_not_none) + df.loc[mask, col] = '"' + df.loc[mask, col].astype(str) + '"' + + return df + + +def write_decriptor_df_to_csvs(df): + # write descriptor node info to a csv + df_descriptor = df.drop(columns=['DescriptorParentID']).drop_duplicates() + df_descriptor.to_csv(FILEPATH_MESH_DESCRIPTOR, doublequote=False, escapechar='\\') + # write descriptor mapping info to csv file + df_mapping = df[['Descriptor_dcid', 'DescriptorParentID']].dropna().drop_duplicates() + df_mapping.to_csv(FILEPATH_MESH_DESCRIPTOR_MAPPING, doublequote=False, escapechar='\\') + return + + +def format_descriptor_df(df): + # prepares csv specific to descriptor nodes and their properties + # drop columns not required for the descriptor file + df = df.drop(columns=[ + 'QualifierID', 'QualifierName', 'QualifierAbbreviation', 'ConceptID', + 'ConceptName', 'TermID', 'TermName' + ]) + # retrieve first value from ScopeNote list + df['ScopeNote'] = df['ScopeNote'].str[0] + # explode the TreeNumber column + df = df.explode('TreeNumber') + # create descriptor dcid + df['Descriptor_dcid'] = 'bio/' + df['DescriptorID'].astype(str) + # add quotes for Descriptor Name text type column + df['DescriptorName'] = '"' + df.DescriptorName + '"' + col_names_quote = ['DescriptorName', 'ScopeNote'] + df = format_text_strings(df, col_names_quote) + # replace missing names with ID + df['DescriptorName'] = df['DescriptorName'].fillna(df['DescriptorID']) + # retrieve the descriptor parent ID using tree number + df['DescriptorParentID'] = df['TreeNumber'].str[:-4] + map_dict = dict(zip(df['TreeNumber'], df['Descriptor_dcid'])) + df = df.replace({"DescriptorParentID": map_dict}) + df["DescriptorParentID"] = np.where(df['DescriptorParentID'].str[0] == "b", df["DescriptorParentID"], np.nan) + # write descriptor data to csv files + write_decriptor_df_to_csvs(df) + return + + +def format_qualifier_df(df): + # prepares a csv specific to qualifier nodes and their properties + df = df.drop(columns=[ + 'DescriptorID', 'DescriptorName', 'ConceptID', 'ConceptName', + 'ScopeNote', 'TermID', 'TermName', 'TreeNumber', + 'NLMClassificationNumber', 'DateCreated', 'DateRevised', + 'DateEstablished' + ]) + # Explode the Qualifier columns + explode_cols = ['QualifierID', 'QualifierName', 'QualifierAbbreviation'] + df = df.explode(explode_cols) + # remove missing qualifier rows + df = df[df['QualifierID'].notna()] + # add quotes from text type columns and replaces "nan" with qualifier ID + col_names_quote = ['QualifierName'] + df = format_text_strings(df, col_names_quote) + # replace missing names with ID + df['QualifierName'] = df['QualifierName'].fillna(df['QualifierID']) + # create qualifier dcids + df['Qualifier_dcid'] = 'bio/' + df['QualifierID'].astype(str) + # drop duplicate rows + df = df.drop_duplicates() + # write df to csv file + df.to_csv(FILEPATH_MESH_QUALIFIER, doublequote=False, escapechar='\\') + return + + +def format_qualifier_mapping_df(df): + # processes a csv containing the mappings between descriptors and qualifiers + # drops columns not required for the qualifier file + df = df.drop(columns=[ + 'DescriptorName', 'ConceptID', 'ConceptName', 'ScopeNote', + 'TermID', 'TermName', 'TreeNumber', 'NLMClassificationNumber', + 'QualifierName', 'QualifierAbbreviation', 'DateCreated', + 'DateRevised', 'DateEstablished' + ]) + # Explode the Qualifier ID column + df = df.explode('QualifierID') + # drop duplicate rows and rows with missing values + df = df.dropna() + df = df.drop_duplicates() + # create qualifier and descriptor dcids + df['Qualifier_dcid'] = 'bio/' + df['QualifierID'].astype(str) + df['Descriptor_dcid'] = 'bio/' + df['DescriptorID'].astype(str) + # write df to csv file + df.to_csv(FILEPATH_MESH_QUALIFIER_MAPPING, doublequote=False, escapechar='\\') + return + + +def write_concpet_df_to_csvs(df): + # write descriptor node info to a csv + df_concept = df.drop(columns=['DescriptorID']) + df_concept.to_csv(FILEPATH_MESH_CONCEPT, doublequote=False, escapechar='\\') + # write descriptor mapping info to csv file + df_mapping = df[['Concept_dcid', 'DescriptorID']].dropna().drop_duplicates() + df_mapping['Descriptor_dcid'] = 'bio/' + df_mapping['DescriptorID'].astype(str) # generate Descriptor dcid + df_mapping.to_csv(FILEPATH_MESH_CONCEPT_MAPPING, doublequote=False, escapechar='\\') + return + + +def format_concept_df(df): + # writes df specific to concept nodes and properties + df = df.drop(columns=[ + 'DescriptorName', 'QualifierID', 'QualifierName', + 'QualifierAbbreviation', 'TermID', 'TermName', 'TreeNumber', + 'NLMClassificationNumber', 'DateCreated', 'DateRevised', + 'DateEstablished' + ]) + # explode on Concept columns + explode_cols = ['ConceptID', 'ConceptName', 'ScopeNote'] + df = df.explode(explode_cols) + # reformat missing values remove and trailing white space in ScopeNote + df['ScopeNote'] = df['ScopeNote'].replace('None', '') + # adds quotes from text type columns and replaces "nan" with np.nan + col_names_quote = ['ConceptName', 'ScopeNote'] + df = format_text_strings(df, col_names_quote) + # replace missing names with ID + df['ConceptName'] = df['ConceptName'].fillna(df['ConceptID']) + # generates concept and descriptor dcids + df['Concept_dcid'] = 'bio/' + df['ConceptID'].astype(str) + # write df to csvs + write_concpet_df_to_csvs(df) + return + + +def write_term_df_to_csvs(df): + # write descriptor node info to a csv + df_term = df.drop(columns=['ConceptID']).drop_duplicates() + df_term.to_csv(FILEPATH_MESH_TERM, doublequote=False, escapechar='\\') + # write descriptor mapping info to csv file + df_mapping = df[['ConceptID', 'Term_dcid']].dropna().drop_duplicates() + df_mapping['Concept_dcid'] = 'bio/' + df_mapping['ConceptID'].astype(str) # generate Concept dcid + df_mapping.to_csv(FILEPATH_MESH_TERM_MAPPING, doublequote=False, escapechar='\\') + return + + +def format_term_df(df): + # prepares csv specific to term nodes and their properties + df = df.drop(columns=[ + 'QualifierID', 'QualifierName', 'QualifierAbbreviation', 'ScopeNote', + 'DescriptorName', 'DescriptorID', 'TreeNumber', 'NLMClassificationNumber', + 'DateCreated', 'DateRevised', 'DateEstablished', 'ConceptName' + ]) + # explode on concept and term and then again on term columns + explode_cols = ['ConceptID', 'TermID', 'TermName'] + df = df.explode(explode_cols) + explode_cols_2 = ['TermID', 'TermName'] + df = df.explode(explode_cols_2) + # add quotes from text type columns and replaces "nan" with np.nan + col_names_quote = ['TermName'] + df = format_text_strings(df, col_names_quote) + # replace missing names with ID + df['TermName'] = df['TermName'].fillna(df['TermID']) + # generate term dcids + df['Term_dcid'] = 'bio/' + df['TermID'].astype(str) + # write df to csvs + write_term_df_to_csvs(df) + return + + +def main(): + # read in file + file_input = sys.argv[1] + # convert xml to pandas df + df = format_mesh_xml(file_input) + df = date_modify(df) + # format csvs corresponding to different mesh node types + format_descriptor_df(df) + format_qualifier_df(df) + format_qualifier_mapping_df(df) + format_concept_df(df) + format_term_df(df) + + +if __name__ == "__main__": + main() diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_pa.py b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_pa.py new file mode 100644 index 0000000000..64bf86bf46 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_pa.py @@ -0,0 +1,161 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +''' +Author: Samantha Piekos +Date: 03/06/24 +Name: format_mesh_pa.py +Description: converts nested .xml to .csv and further breaks down the csv +into a csv about the pharmacological actions associated with drugs. +@file_input: input .xml downloaded from NCBI +@file_output: formatted csv files ready for import into data commons kg + with corresponding tmcf file +''' + +# set up environment +#from lxml import etree +import xml.etree.ElementTree as ET +import numpy as np +import pandas as pd +import sys + + +# declare universal variables +FILEPATH_OUTPUT_PREFIX = 'CSVs/mesh_pharmacological_action_' + + +def extract_data_from_xml(xml_filepath): + """ + extract data on descriptors and substances from the xml file + and store in list + """ + # read in xml data + with open(xml_filepath, 'r') as file: + xml_data = file.read() + root = ET.fromstring(xml_data) + data = [] # List to store extracted data + + for action in root.findall('PharmacologicalAction'): + descriptor_ui = action.find('DescriptorReferredTo/DescriptorUI').text + descriptor_name = action.find('DescriptorReferredTo/DescriptorName/String').text + + record_ui_data = [] + record_name_data = [] + for substance in action.find('PharmacologicalActionSubstanceList'): + record_ui = substance.find('RecordUI').text + record_name = substance.find('RecordName/String').text + record_ui_data.append(record_ui) + record_name_data.append(record_name) + + data.append({'DescriptorUI': descriptor_ui, 'DescriptorName': descriptor_name,\ + 'RecordUI': record_ui_data, 'RecordName': record_name_data}) + + return data + + +def format_mesh_xml(xml_data): + """ + Parses the xml file and converts it to a csv with + required columns + Args: + xml_data = xml file to be parsed + Returns: + pandas df after parsing + """ + # parse xml file + data = extract_data_from_xml(xml_data) + # initiate pandas df + df = pd.DataFrame(data) + # Explode the 'Substances' column + df = df.explode(['RecordUI', 'RecordName']) + # Reset the index + df = df.reset_index(drop=True) + return df + + +def is_not_none(x): + # check if value exists + if pd.isna(x): + return False + return True + + +def format_text_strings(df, col_names): + """ + Converts missing values to numpy nan value and adds outside quotes + to strings (excluding np.nan). Applies change to columns specified in col_names. + """ + + for col in col_names: + df[col] = df[col].str.rstrip() # Remove trailing whitespace + df[col] = df[col].replace([''],np.nan) # replace missing values with np.nan + + # Quote only string values + mask = df[col].apply(is_not_none) + df.loc[mask, col] = '"' + df.loc[mask, col].astype(str) + '"' + + return df + + +def get_first_letter(data_type): + # returns the first letter in the mesh unique id based on data type + if data_type == 'descriptor': + return 'D' + if data_type == 'record': + return 'C' + print('Warning! Unexpected MeSH data type in RecordUI column!') + return + + +def generate_mesh_type_specific_csv(df, data_type): + # get expected first letter of RecordUI for mesh data type of interest + first_letter = get_first_letter(data_type) + # filter for rows containing RecordUIs that are the data type of interest + df = df[df['RecordUI'].str[0] == first_letter] + # save df to csv + filepath_output = FILEPATH_OUTPUT_PREFIX + data_type + '.csv' + df.to_csv(filepath_output, doublequote=False, escapechar='\\') + return + + +def format_pharmacological_action_df(df): + """ + Formats strings and dcids for import into the kg + """ + # adds quotes from text type columns and replaces "nan" with qualifier ID + col_names_quote = ['DescriptorName', 'RecordName'] + df = format_text_strings(df, col_names_quote) + # replace missing names with ID + df['DescriptorName'] = df['DescriptorName'].fillna(df['DescriptorUI']) + df['RecordName'] = df['RecordName'].fillna(df['RecordUI']) + # create descriptor dcids and dcids for corresponding descriptor or records + df['Descriptor_dcid'] = 'bio/' + df['DescriptorUI'].astype(str) + df['dcid'] = 'bio/' + df['RecordUI'].astype(str) + # drops the duplicate rows + df = df.drop_duplicates() + # create csvs mapping pharamacological actions to descriptors or supplemntar records + generate_mesh_type_specific_csv(df, 'descriptor') + generate_mesh_type_specific_csv(df, 'record') + + +def main(): + # read in file + file_input = sys.argv[1] + # convert xml to pandas df + df = format_mesh_xml(file_input) + # format csvs for ingestion into biomedical data commons + format_pharmacological_action_df(df) + + +if __name__ == "__main__": + main() diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_qual.py b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_qual.py new file mode 100644 index 0000000000..36fff9bc06 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_qual.py @@ -0,0 +1,314 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +''' +Author: Samantha Piekos +Date: 03/07/24 +Name: format_mesh_qual.py +Description: converts nested .xml to .csv and further breaks down the csv +into four csvs containing invormation on qualifiers, concepts, terms or +concept mappings to other concepts. +@file_output: formatted csv files ready for import into data commons kg + with corresponding tmcf file +''' + +# set up environment +import numpy as np +import pandas as pd +import sys +import xml.etree.ElementTree as ET + +# declare universal variables +FILEPATH_MESH_CONCEPT = 'CSVs/mesh_qual_concept.csv' +FILEPATH_MESH_CONCEPT_MAPPING = 'CSVs/mesh_qual_concept_mapping.csv' +FILEPATH_MESH_QUALIFIER = 'CSVs/mesh_qual_qualifier.csv' +FILEPATH_MESH_TERM = 'CSVs/mesh_qual_term.csv' + + +def parse_date(date_element, col): + # extract date elements from xml and format + # return as string YYYY-MM-DD value + if date_element is not None: + year = date_element.find('Year').text + month = date_element.find('Month').text + day = date_element.find('Day').text + date = ('-').join([year, month, day]) + return date + return None + + +def parse_tree_list(tree): + # extract all tree numbers and store in list + tree_data = [] + if tree.find('TreeNumber'): + for tree in action.find('TreeNumberList'): + if tree.find('TreeNumber') is not None: + tree_number = tree.find('TreeNumber').text + tree_data.append(tree_number) + else: + tree_data.append('') + return tree_data + + +def parse_associated_concepts(concept): + # extract all concept relationship pairs storing pairs as lists + # return all pairs in list (nested list) + list_concepts = [] + if concept.find('ConceptRelationList') is not None: + for relation in concept.find('ConceptRelationList'): + concept1 = relation.find('Concept1UI').text + concept2 = relation.find('Concept2UI').text + list_concepts.append([concept1, concept2]) + return list_concepts + + +def parse_booleans(tree, query): + # extract boolean values and convert to boolean + data_str = tree.get(query) + data = data_str == 'Y' + return data + + +def handle_potentially_missing_col(data, col): + # extract data element that may be missing from xml + if data.find(col) is not None: + return data.find(col).text + return '' + + +def parse_terms(concept): + # store all terms data in dictonary with values for terms + # associated with a given concept stored in lists as values + terms = { + 'TermUI': [], 'TermName': [], 'Abbreviation': [],\ + 'Display': [], 'DateCreated': [],\ + 'is_concept_preferred_term': [], 'is_permuted_term': [],\ + 'is_record_preferred_term': [] + } + for term in concept.find('TermList'): + terms['TermUI'].append(term.find('TermUI').text) + terms['TermName'].append(term.find('String').text) + terms['DateCreated'].append(parse_date(term.find('DateCreated'), 'DateCreated')) + terms['Abbreviation'].append(handle_potentially_missing_col(term, 'Abbreviation')) + terms['Display'].append(handle_potentially_missing_col(term, 'EntryVersion')) + terms['is_concept_preferred_term'].append(parse_booleans(term, 'ConceptPreferredTermYN')) + terms['is_permuted_term'].append(parse_booleans(term, 'ConceptPermutedYN')) + terms['is_record_preferred_term'].append(parse_booleans(term, 'ConceptRecordTermYN')) + return terms + + +def format_mesh_xml(xml_filepath): + """ + extract data on descriptors and substances from the xml file + and store in list + """ + # read in xml data + with open(xml_filepath, 'r') as file: + xml_data = file.read() + root = ET.fromstring(xml_data) + data = [] # List to store extracted data + + for action in root.findall('QualifierRecord'): + # parse qualifier data + qualifier_ui = action.find('QualifierUI').text + qualifier_name = action.find('QualifierName/String').text + annotation = action.find('Annotation').text + history_note = action.find('HistoryNote').text + tree_list = action.find('TreeNumberList') + tree_data = [number.text for number in tree_list.findall('TreeNumber')] + tree_data = ','.join(tree_data) + + # parse dates + date_created = parse_date(action.find('DateCreated'), 'DateCreated') + date_revised = parse_date(action.find('DateRevised'), 'DateRevised') + date_established = parse_date(action.find('DateEstablished'), 'DateEstablished') + + # parse concept info + concept_ui = [] + concept_name = [] + scope_note = [] + associated_concepts = [] + is_preferred_concept = [] + terms = [] + for concept in action.find('ConceptList'): + concept_ui.append(concept.find('ConceptUI').text) + concept_name.append(concept.find('ConceptName/String').text) + associated_concepts.append(parse_associated_concepts(concept)) + is_preferred_concept.append(parse_booleans(concept, 'PreferredConceptYN')) + terms.append(parse_terms(concept)) + scope_note.append(handle_potentially_missing_col(concept, 'ScopeNote')) + + data.append({ + 'QualifierUI': qualifier_ui, 'QualifierName': qualifier_name, + 'Annotation': annotation, 'HistoryNote': history_note, + 'TreeNumber': tree_data, 'DateCreated': date_created, + 'DateRevised': date_revised, 'DateEstablished': date_established, + 'ConceptUI': concept_ui, 'ConceptName': concept_name, + 'ScopeNote': scope_note, 'AssociatedConcepts': associated_concepts, + 'IsPreferredConcept': is_preferred_concept, 'Terms': terms + }) + + return pd.DataFrame(data) + + +def is_not_none(x): + # check if value exists + if pd.isna(x): + return False + return True + + +def format_text_strings(df, col_names): + """ + Converts missing values to numpy nan value and adds outside quotes + to strings (excluding np.nan). Applies change to columns specified in col_names. + """ + + for col in col_names: + df[col] = df[col].str.rstrip() # Remove trailing whitespace + df[col] = df[col].replace([''],np.nan) # replace missing values with np.nan + + # Quote only string values + mask = df[col].apply(is_not_none) + df.loc[mask, col] = '"' + df.loc[mask, col].astype(str) + '"' + + return df + + +def format_qualifier_df(df): + # create csv specific to qualifiers and their properties + # drop columns not required for the qualifier file + df = df.drop(columns=[ + 'ConceptUI', 'ConceptName', 'ScopeNote', 'AssociatedConcepts', + 'IsPreferredConcept', 'Terms' + ]) + # remove missing qualifier rows + df = df[df['QualifierUI'].notna()] + # adds quotes from text type columns and replaces "nan" with qualifier ID + col_names_quote = ['QualifierName', 'Annotation', 'HistoryNote'] + df = format_text_strings(df, col_names_quote) + # replace missing names with ID + df['QualifierName'] = df['QualifierName'].fillna(df['QualifierUI']) + # creates qualifier dcids + df['Qualifier_dcid'] = 'bio/' + df['QualifierUI'].astype(str) + # drops the duplicate rows + df = df.drop_duplicates() + # write df to csv + df.to_csv(FILEPATH_MESH_QUALIFIER, doublequote=False, escapechar='\\') + return df + + +def format_concept_df(df): + # create csv specific to concept nodes and their properties + # drop columns not required for the qualifier file + df = df.drop(columns=[ + 'QualifierName', 'Annotation', 'HistoryNote', 'TreeNumber', + 'DateCreated', 'DateRevised', 'DateEstablished', 'Terms', + 'AssociatedConcepts' + ]) + # remove missing concept rows + df = df[df['ConceptUI'].notna()] + # Explode the Concept columns + explode_cols = ['ConceptUI', 'ConceptName', 'ScopeNote', 'IsPreferredConcept'] + df = df.explode(explode_cols) + # adds quote from text type columns and replaces "nan" with qualifier ID + col_names_quote = ['ConceptName', 'ScopeNote'] + df = format_text_strings(df, col_names_quote) + # replace missing names with ID + df['ConceptName'] = df['ConceptName'].fillna(df['ConceptUI']) + # create qualifier and concept dcids + df['Concept_dcid'] = 'bio/' + df['ConceptUI'].astype(str) + df['Qualifier_dcid'] = 'bio/' + df['ConceptUI'].astype(str) + # drop the duplicate rows + df = df.drop_duplicates() + # write df to csv + df.to_csv(FILEPATH_MESH_CONCEPT, doublequote=False, escapechar='\\') + return df + + +def format_concept_relations_df(df): + # create csv specific to mapping concept to other mesh data types + # drop columns not required for the qualifier mapping file + df = df.drop(columns=[ + 'QualifierName', 'Annotation', 'TreeNumber', 'HistoryNote', + 'DateCreated', 'DateRevised', 'DateEstablished', 'Terms', + 'ScopeNote', 'ConceptName' + ]) + # remove missing concept rows + df = df[df['ConceptUI'].notna()] + # Explode the Concept columns + explode_cols = ['ConceptUI', 'IsPreferredConcept', 'AssociatedConcepts'] + df = df.explode(explode_cols) + df = df.explode('AssociatedConcepts') + df = df.explode('AssociatedConcepts') + df[df['ConceptUI'] != df['AssociatedConcepts']] + df = df[df['IsPreferredConcept'] == False] + # create qualifier and concept dcids + df['Concept_dcid'] = 'bio/' + df['ConceptUI'].astype(str) + df['Preferred_Concept_dcid'] = 'bio/' + df['AssociatedConcepts'].astype(str) + # drop the duplicate rows and extra columns + df = df.drop(['QualifierUI', 'IsPreferredConcept', 'ConceptUI', 'AssociatedConcepts'], axis=1) + df = df.drop_duplicates() + rows_to_drop = df['Concept_dcid'] == df['Preferred_Concept_dcid'] + df = df[~rows_to_drop] + # write df to csv + df.to_csv(FILEPATH_MESH_CONCEPT_MAPPING, doublequote=False, escapechar='\\') + return df + + +def format_terms_df(df): + # create formatted csv specific to Term nodes and thier properties + # drop columns not required for the qualifier file + df = df.drop(columns=[ + 'QualifierName', 'Annotation', 'HistoryNote', 'TreeNumber', + 'DateCreated', 'DateRevised', 'DateEstablished', 'ScopeNote', + 'ConceptName', 'IsPreferredConcept', 'AssociatedConcepts' + ]) + # remove missing concept rows + df = df[df['ConceptUI'].notna()] + # Explode the Concept columns + explode_cols = ['ConceptUI', 'Terms'] + df = df.explode(explode_cols).reset_index() + df2 = pd.json_normalize(df['Terms']) + df = pd.concat([df.drop(['Terms'], axis=1), df2], axis=1) + df = df.drop(['index'], axis=1) + explode_cols = list(df2.columns) + df = df.explode(explode_cols) + # adds quotes from text type columns and replaces "nan" with qualifier ID + col_names_quote = ['TermName', 'Abbreviation', 'Display'] + df = format_text_strings(df, col_names_quote) + # creates qualifier and concept dcids + df['Concept_dcid'] = 'bio/' + df['ConceptUI'].astype(str) + df['Term_dcid'] = 'bio/' + df['TermUI'].astype(str) + # drops the duplicate rows and extra columns + df = df.drop(['QualifierUI', 'is_permuted_term'], axis=1) + df = df.drop_duplicates() + # write df to csv + df.to_csv(FILEPATH_MESH_TERM, doublequote=False, escapechar='\\') + return + + +def main(): + # read in file + file_input = sys.argv[1] + # convert xml file to pandas df + df = format_mesh_xml(file_input) + # format CSV files for each level of the xml file + format_qualifier_df(df) + format_concept_df(df) + format_concept_relations_df(df) + format_terms_df(df) + + +if __name__ == "__main__": + main() diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_supp.py b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_supp.py new file mode 100644 index 0000000000..e5bd649639 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_supp.py @@ -0,0 +1,174 @@ +# Copyright 2022 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +''' +Author: Suhana Bedi +Date: 08/02/2022 +Edited By: Samantha Piekos +Last Edited: 03/06/2024 +Name: format_mesh_supp.py +Description: converts nested .xml to .csv and the csv entails the relationship between +the descriptor record ID and descriptor ID for MESH terms +@file_input: input .xml downloaded from NCBI +''' + + +# set up environment +import sys +import pandas as pd +import numpy as np +from xml.etree.ElementTree import parse + + +# declare universal variables +FILEPATH_MESH_PUBCHEM_MAPPING = 'CSVs/mesh_pubchem_mapping.csv' +FILEPATH_RECORD = 'CSVs/mesh_record.csv' + +def read_mesh_record(mesh_record_xml): + """ + Parses the xml file and converts it to a csv with + required columns + Args: + mesh_xml = xml file to be parsed + Returns: + df = pandas df after parsing + """ + document = parse(mesh_record_xml) + d = [] + dfcols = [ + 'RecordID', 'RecordName', 'DateCreated-Year', + 'DateCreated-Month', 'DateCreated-Day', 'DateRevised-Year', + 'DateRevised-Month', 'DateRevised-Day', 'DescriptorID' + ] + df = pd.DataFrame(columns=dfcols) + for item in document.iterfind('SupplementalRecord'): + d1 = item.findtext('SupplementalRecordUI') + elem = item.find(".//SupplementalRecordName") + d1_name = elem.findtext("String") + date_created = item.find(".//DateCreated") + if date_created is None: + d1_created_year = np.nan + d1_created_month = np.nan + d1_created_day = np.nan + else: + d1_created_year = date_created.findtext("Year") + d1_created_month = date_created.findtext("Month") + d1_created_day = date_created.findtext("Day") + date_revised = item.find(".//DateRevised") + if date_revised is None: + d1_revised_year = np.nan + d1_revised_month = np.nan + d1_revised_day = np.nan + else: + d1_revised_year = date_revised.findtext("Year") + d1_revised_month = date_revised.findtext("Month") + d1_revised_day = date_revised.findtext("Day") + heading_list = item.find(".//HeadingMappedToList") + headID = [] + if heading_list is None: + headID.append(np.nan) + else: + l1 = heading_list.findall(".//HeadingMappedTo") + for i in range(len(l1)): + l2 = l1[i].find(".//DescriptorReferredTo") + headID.append(l2.findtext("DescriptorUI")) + d.append({'RecordID':d1, 'RecordName':d1_name, 'DateCreated-Year':d1_created_year, 'DateCreated-Month':d1_created_month, 'DateCreated-Day':d1_created_day, + 'DateRevised-Year':d1_revised_year, 'DateRevised-Month':d1_revised_month, 'DateRevised-Day':d1_revised_day, + 'DescriptorID':headID}) + df = pd.DataFrame(d) + return df + + +def format_dates(df): + """ + Modifies the dates in a df, into an ISO format + Args: + df1 = df with date columns + Returns: + df with modified date columns + + """ + df['DateCreated'] = df['DateCreated-Year'].astype( + str) + "-" + df['DateCreated-Month'].astype( + str) + "-" + df['DateCreated-Day'].astype(str) + df['DateRevised'] = df['DateRevised-Year'].astype( + str) + "-" + df['DateRevised-Month'].astype( + str) + "-" + df['DateRevised-Day'].astype(str) + col_names_quote = ['DateCreated', 'DateRevised'] + ## adds quotes from text type columns and replaces "nan" with np.nan + for col in col_names_quote: + df[col] = df[col].replace(["nan-nan-nan"],np.nan) + ## drop repetitive column values + df = df.drop(columns=[ + 'DateCreated-Year', 'DateCreated-Month', 'DateCreated-Day', + 'DateRevised-Year', 'DateRevised-Month', 'DateRevised-Day' + ]) + return df + + +def format_record_csv(df): + """ + Formats the MESH record ID, record name and corresponding descriptor IDs and DCIDs + Args: + df: pandas dataframe with zipped and unformatted descriptor IDs + + Returns: + df : pandas dataframe with formatted and unzipped descriptor IDs corresponding to record ID + """ + # Explode the DescriptorID column + df = df.explode('DescriptorID') + # Clean up DescriptorID values (remove leading/trailing '*') + df['DescriptorID'] = df['DescriptorID'].str.strip('*') + ## removes special characters from the descriptor column + df['DescriptorID'] = df['DescriptorID'].str.replace(r'\W', '') + ## puts quotes around record name string values + df['RecordName'] = '"' + df.RecordName + '"' + ## generates record and descriptor dcids + df['Record_dcid'] = 'bio/' + df['RecordID'].astype(str) + df['Descriptor_dcid'] = 'bio/' + df['DescriptorID'].astype(str) + df.to_csv(FILEPATH_RECORD, doublequote=False, escapechar='\\') + return df + + +def format_pubchem_mesh_mapping(pubchem_file, df_mesh): + # read in pubchem mesh mapping csv file + df_pubchem = pd.read_csv(pubchem_file, on_bad_lines='skip', sep='\t', header = None, names = ['CID', 'CompoundName']) + # seperate compound name as own column + df_pubchem['CompoundName'] = '"' + df_pubchem.CompoundName + '"' + # merge with mesh record df on names + df_match = pd.merge(df_mesh, df_pubchem, left_on='RecordName', right_on='CompoundName', how = 'inner') + # filter for desired columns in output csv + df_match = df_match.filter(['CID', 'RecordID', 'RecordName', 'Record_dcid'], axis=1) + # format compound dcids + df_match['CID_dcid'] = 'chem/CID' + df_match['CID'].astype(str) + # drop duplicates + df_match = df_match.drop_duplicates() + # write df to csv + df_match.to_csv(FILEPATH_MESH_PUBCHEM_MAPPING, doublequote=False, escapechar='\\') + return + + +def main(): + # read in files + file_input = sys.argv[1] + file_pubchem = sys.argv[2] + # convert mesh record xml file to pandas df + df = read_mesh_record(file_input) + df = format_dates(df) + df_mesh = format_record_csv(df) + # create pubchem mesh mapping csv and mesh record csv + format_pubchem_mesh_mapping(file_pubchem, df_mesh) + + +if __name__ == "__main__": + main() diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/run.sh b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/run.sh new file mode 100644 index 0000000000..77c66a9f6c --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/run.sh @@ -0,0 +1,19 @@ +#!/bin/bash + +mkdir -p CSVs + +# extracts the mesh descriptor, term, concept, qualifier terms into 4 csvs +python3 scripts/format_mesh_desc.py input/mesh-desc.xml +echo "MeSH descriptor file processed" + +# extracts pharmacological actions associated with substances +python3 scripts/format_mesh_pa.py input/mesh-pa.xml +echo "MeSH pharmacological action file processed" + +# extracts qualifier data +python3 scripts/format_mesh_qual.py input/mesh-qual.xml +echo "MeSH qualifier file processed" + +# extracts the mesh records and maps it with pubchem IDs +python3 scripts/format_mesh_supp.py input/mesh-supp.xml input/mesh-pubchem.csv +echo "MeSH record file and pubchem mappings processed" diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/tests.sh b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/tests.sh new file mode 100644 index 0000000000..4ae2f6b959 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/tests.sh @@ -0,0 +1,85 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Author: Samantha Piekos +Date: 03/11/2024 +Name: tests +Description: This file runs the Data Commons Java tool to run standard +tests on tmcf + CSV pairs for the NIH NLM MeSH import. This assumes that +the user has Java Remote Environment (JRE) installed, which is needed to +locally install Data Commons test tool (v. 0.1-alpha.1k) prior to calling +the tool to evaluate tmcf + CSV pairs. +""" + +#!/bin/bash + +# download data commons java test tool version 0.1-alpha.1k +mkdir -p tmp; cd tmp +wget https://github.com/datacommonsorg/import/releases/download/0.1-alpha.1k/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar +cd .. + +# run tests on desc file csv + tmcf pairs +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_concept.tmcf CSVs/mesh_desc_concept.csv +mv dc_generated mesh_desc_concept + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_concept_mapping.tmcf CSVs/mesh_desc_concept_mapping.csv +mv dc_generated mesh_desc_concept_mapping + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_descriptor.tmcf CSVs/mesh_desc_descriptor.csv +mv dc_generated mesh_desc_descriptor + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_descriptor_mapping.tmcf CSVs/mesh_desc_descriptor_mapping.csv +mv dc_generated mesh_desc_descriptor_mapping + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_qualifier.tmcf CSVs/mesh_desc_qualifier.csv +mv dc_generated mesh_desc_qualifier + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_qualifier_mapping.tmcf CSVs/mesh_desc_qualifier_mapping.csv +mv dc_generated mesh_desc_qualifier_mapping + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_term.tmcf CSVs/mesh_desc_term.csv +mv dc_generated mesh_desc_term + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_term_mapping.tmcf CSVs/mesh_desc_term_mapping.csv +mv dc_generated mesh_desc_term_mapping + + +# run tests on pa file csv + tmcf pairs +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_pharmacological_action_descriptor.tmcf CSVs/mesh_pharmacological_action_descriptor.csv +mv dc_generated mesh_pharmacological_action_descriptor + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_pharmacological_action_record.tmcf CSVs/mesh_pharmacological_action_record.csv +mv dc_generated mesh_pharmacological_action_record + + +# run tests on qual file csv + tmcf pairs +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_qual_concept.tmcf CSVs/mesh_qual_concept.csv +mv dc_generated mesh_qual_concept + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_qual_concept_mapping.tmcf CSVs/mesh_qual_concept_mapping.csv +mv dc_generated mesh_qual_concept_mapping + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_qual_qualifier.tmcf CSVs/mesh_qual_qualifier.csv +mv dc_generated mesh_qual_qualifier + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_qual_term.tmcf CSVs/mesh_qual_term.csv +mv dc_generated mesh_qual_term + + +# run tests on record and pubchem files csv + tmcf pairs +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_record.tmcf CSVs/mesh_record.csv +mv dc_generated mesh_record + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_pubchem_mapping.tmcf CSVs/mesh_pubchem_mapping.csv +mv dc_generated mesh_pubchem_mapping From 61878012ba727f023794025d7f7173d442fee1eb Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Mon, 11 Mar 2024 16:40:30 -0700 Subject: [PATCH 08/19] Update README.md fix typo --- scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md index efb6857c72..bda70a2e7a 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md @@ -85,7 +85,7 @@ Any works found on National Library of Medicine (NLM) Web sites may be freely us [`format_mesh_desc.py`](scripts/format_mesh_desc.py) converts the original xml into eight formatted csv files, which each can be imported alongside it's matching tMCF. -[`format_mesh_pan.py`](scripts/format_mesh_pa.py) converts the original csv file into two formatted csv files, which can be imported alongside it's matching tMCF. +[`format_mesh_pa.py`](scripts/format_mesh_pa.py) converts the original csv file into two formatted csv files, which can be imported alongside it's matching tMCF. [`format_mesh_qual.py`](scripts/format_mesh_qual.py) converts the original xml into four formatted csv files, which each can be imported alongside it's matching tMCF. From 570bcc5344f77a2d2c7d516e430e60e1534db0f0 Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Mon, 11 Mar 2024 16:41:01 -0700 Subject: [PATCH 09/19] Update mesh_qual_term.tmcf --- .../NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf | 1 + 1 file changed, 1 insertion(+) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf index 753cca26ba..c001bd1132 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf @@ -1,5 +1,6 @@ Node: E:mesh_qual_term->E1 typeOf: dcs:MeSHConcept +dcid: C:mesh_qual_term->Concept_dcid identifier: C:mesh_qual_term->ConceptUI Node: E:mesh_qual_term->E2 From 2611aec1b524f86673ffb16f8eb31953d2e38e67 Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Mon, 11 Mar 2024 22:50:58 -0700 Subject: [PATCH 10/19] Update format_mesh_desc.py fix quote error in name property of MeSHDescriptor nodes --- .../Medical_Subject_Headings/scripts/format_mesh_desc.py | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_desc.py b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_desc.py index 0827920488..fbfff0046a 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_desc.py +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_desc.py @@ -256,8 +256,7 @@ def format_descriptor_df(df): df = df.explode('TreeNumber') # create descriptor dcid df['Descriptor_dcid'] = 'bio/' + df['DescriptorID'].astype(str) - # add quotes for Descriptor Name text type column - df['DescriptorName'] = '"' + df.DescriptorName + '"' + # add quotes from text type columns and replaces "nan" with np.nan col_names_quote = ['DescriptorName', 'ScopeNote'] df = format_text_strings(df, col_names_quote) # replace missing names with ID @@ -285,7 +284,7 @@ def format_qualifier_df(df): df = df.explode(explode_cols) # remove missing qualifier rows df = df[df['QualifierID'].notna()] - # add quotes from text type columns and replaces "nan" with qualifier ID + # add quotes from text type columns and replaces "nan" with np.nan col_names_quote = ['QualifierName'] df = format_text_strings(df, col_names_quote) # replace missing names with ID From e7a60f232f5199330080a21a2276e964529457f0 Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Tue, 2 Apr 2024 17:13:14 -0700 Subject: [PATCH 11/19] update formatting scripts fix bug in print out of a column in format_mesh_qual.py and removes the extra character '^' from the supplementary concept record names in format_mesh_qual.py --- .../Medical_Subject_Headings/scripts/format_mesh_pa.py | 1 + .../Medical_Subject_Headings/scripts/format_mesh_qual.py | 4 ++-- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_pa.py b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_pa.py index 64bf86bf46..d033f782af 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_pa.py +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_pa.py @@ -54,6 +54,7 @@ def extract_data_from_xml(xml_filepath): for substance in action.find('PharmacologicalActionSubstanceList'): record_ui = substance.find('RecordUI').text record_name = substance.find('RecordName/String').text + record_name = record_name.strip('^') # remove bad character record_ui_data.append(record_ui) record_name_data.append(record_name) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_qual.py b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_qual.py index 36fff9bc06..b9f6f61b6e 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_qual.py +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_qual.py @@ -13,7 +13,7 @@ # limitations under the License. ''' Author: Samantha Piekos -Date: 03/07/24 +Date: 04/02/24 Name: format_mesh_qual.py Description: converts nested .xml to .csv and further breaks down the csv into four csvs containing invormation on qualifiers, concepts, terms or @@ -228,7 +228,7 @@ def format_concept_df(df): df['ConceptName'] = df['ConceptName'].fillna(df['ConceptUI']) # create qualifier and concept dcids df['Concept_dcid'] = 'bio/' + df['ConceptUI'].astype(str) - df['Qualifier_dcid'] = 'bio/' + df['ConceptUI'].astype(str) + df['Qualifier_dcid'] = 'bio/' + df['QualifierUI'].astype(str) # drop the duplicate rows df = df.drop_duplicates() # write df to csv From c82d24a7c3877ead5b5a1e5d2087513d918b25a6 Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Tue, 2 Apr 2024 17:14:29 -0700 Subject: [PATCH 12/19] fix typos in 4 tmcf files --- .../tMCFs/mesh_pubchem_mapping.tmcf | 2 +- .../Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf | 2 +- .../Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf | 6 +++--- .../NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_record.tmcf | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pubchem_mapping.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pubchem_mapping.tmcf index 1bd6d1a962..7d81a05c26 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pubchem_mapping.tmcf +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pubchem_mapping.tmcf @@ -4,7 +4,7 @@ dcid: C:mesh_pubchem_mapping->CID_dcid pubChemCompoundID: C:mesh_pubchem_mapping->CID Node: E:mesh_pubchem_mapping->E2 -typeOf: dcs:MeSHSupplementaryRecord +typeOf: dcs:MeSHSupplementaryConceptRecord dcid: C:mesh_pubchem_mapping->Record_dcid name: C:mesh_pubchem_mapping->RecordName identifier: C:mesh_pubchem_mapping->RecordID diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf index e46cff01b5..df1940e9ac 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf @@ -10,4 +10,4 @@ name: C:mesh_qual_concept->ConceptName description: C:mesh_qual_concept->ScopeNote identifier: C:mesh_qual_concept->ConceptUI isPreferredConcept: C:mesh_qual_concept->IsPreferredConcept -hasMeSHQualifer: E:mesh_qual_concept->E1 +hasMeSHQualifier: E:mesh_qual_concept->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf index c001bd1132..74df8e9bba 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf @@ -5,12 +5,12 @@ identifier: C:mesh_qual_term->ConceptUI Node: E:mesh_qual_term->E2 typeOf: dcs:MeSHTerm -dcid: C:mesh_qual_term->Concept_dcid +dcid: C:mesh_qual_term->Term_dcid name: C:mesh_qual_term->TermName abbreviation: C:mesh_qual_term->Abbreviation abbreviation: C:mesh_qual_term->Display dateCreated: C:mesh_qual_term->DateCreated identifier: C:mesh_qual_term->TermUI -isConceptPreferedTerm: C:mesh_qual_term->is_concept_preferred_term -isRecordPreferedTerm: C:mesh_qual_term->is_record_preferred_term +isConceptPreferredTerm: C:mesh_qual_term->is_concept_preferred_term +isRecordPreferredTerm: C:mesh_qual_term->is_record_preferred_term parent: E:mesh_qual_term->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_record.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_record.tmcf index 19970af422..65e18bdc9a 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_record.tmcf +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_record.tmcf @@ -4,7 +4,7 @@ dcid: C:mesh_record->Descriptor_dcid identifier: C:mesh_record->DescriptorID Node: E:mesh_record->E2 -typeOf: dcs:MeSHSupplementaryConcept +typeOf: dcs:MeSHSupplementaryConceptRecord dcid: C:mesh_record->Record_dcid identifier: C:mesh_record->RecordID name: C:mesh_record->RecordName From 5353c2470f91b9e25189aa5f6e4b4b8e85e85136 Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Mon, 24 Jun 2024 20:31:10 -0700 Subject: [PATCH 13/19] Update README.md remove unused subsection from About the Dataset section --- scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md index bda70a2e7a..14ee5dc505 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md @@ -8,7 +8,6 @@ 3. [Notes and Caveats](#notes-and-caveats) 4. [dcid Generation](#dcid-generation) 5. [License](#license) - 6. [Dataset Documentation and Relevant Links](#dataset-documentation-and-relevant-links) 2. [About the import](#about-the-import) 1. [Artifacts](#artifacts) 1. [Scripts](#scripts) @@ -65,8 +64,6 @@ The dcids for ChemicalCompounds were generated using the PubChem compound ID wit Any works found on National Library of Medicine (NLM) Web sites may be freely used or reproduced without permission in the U.S. More information about the license can be found [here](https://www.nlm.nih.gov/web_policies.html). -### Dataset Documentation and Relevant Links - ## About the import ### Artifacts From 844aab4dc25575b2b968b37477bbed1349d47443 Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Fri, 1 Nov 2024 17:31:34 -0700 Subject: [PATCH 14/19] Update mesh_desc_concept.tmcf Declare scopeNote property now that it's added to the schema --- .../Medical_Subject_Headings/tMCFs/mesh_desc_concept.tmcf | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept.tmcf index 10525e0b22..e418966a3e 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept.tmcf +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept.tmcf @@ -2,6 +2,5 @@ Node: E:mesh_desc_concept->E1 typeOf: dcs:MeSHConcept dcid: C:mesh_desc_concept->Concept_dcid name: C:mesh_desc_concept->ConceptName -description: C:mesh_desc_concept->ScopeNote identifier: C:mesh_desc_concept->ConceptID - +scopeNote: C:mesh_desc_concept->ScopeNote From 198ab5bb1a75f645072bfd68a4a87509d35ab8e0 Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Fri, 1 Nov 2024 17:32:22 -0700 Subject: [PATCH 15/19] Update mesh_desc_descriptor.tmcf Declare scopeNote property now that it's added to the schema --- .../Medical_Subject_Headings/tMCFs/mesh_desc_descriptor.tmcf | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor.tmcf index 8283498f26..11f073a786 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor.tmcf +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor.tmcf @@ -5,7 +5,7 @@ name: C:mesh_desc_descriptor->DescriptorName dateCreated: C:mesh_desc_descriptor->DateCreated dateRevised: C:mesh_desc_descriptor->DateRevised dateEstablished: C:mesh_desc_descriptor->DateEstablished -description: C:mesh_desc_descriptor->ScopeNote identifier: C:mesh_desc_descriptor->DescriptorID medicalSubjectHeadingTreeNumber: C:mesh_desc_descriptor->TreeNumber nationalLibraryOfMedicineClassificationNumber: C:mesh_desc_descriptor->NLMClassificationNumber +scopeNote: C:mesh_desc_descriptor->ScopeNote From 61d8c2043bc244a25a681e6a314826e730ae0d03 Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Fri, 1 Nov 2024 17:33:50 -0700 Subject: [PATCH 16/19] Update mesh_qual_concept.tmcf Declare scopeNote property now that it's added to the schema; alphabetize property names in tMCF --- .../Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf index df1940e9ac..82bc358222 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf @@ -7,7 +7,7 @@ Node: E:mesh_qual_concept->E2 typeOf: dcs:MeSHConcept dcid: C:mesh_qual_concept->Concept_dcid name: C:mesh_qual_concept->ConceptName -description: C:mesh_qual_concept->ScopeNote +hasMeSHQualifier: E:mesh_qual_concept->E1 identifier: C:mesh_qual_concept->ConceptUI isPreferredConcept: C:mesh_qual_concept->IsPreferredConcept -hasMeSHQualifier: E:mesh_qual_concept->E1 +scopeNote: C:mesh_qual_concept->ScopeNote From 9a095c4634f4439ac914aba25df2f825980946fd Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Mon, 16 Jun 2025 13:55:20 -0500 Subject: [PATCH 17/19] Update mesh_pharmacological_action_record.tmcf Remove "typeOf: schema:Drug" from mesh_pharmacological_action_record.tmcf --- .../tMCFs/mesh_pharmacological_action_record.tmcf | 1 - 1 file changed, 1 deletion(-) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_record.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_record.tmcf index f915fca71b..63b7ae86c0 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_record.tmcf +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_record.tmcf @@ -5,7 +5,6 @@ identifier: C:mesh_pharmacological_action_record->DescriptorUI Node: E:mesh_pharmacological_action_record->E2 typeOf: dcs:MeSHSupplementaryConceptRecord -typeOf: schema:Drug dcid: C:mesh_pharmacological_action_record->dcid name: C:mesh_pharmacological_action_record->RecordName identifier: C:mesh_pharmacological_action_record->RecordUI From 99c3981ff47d03a92120aadaa023dc0920bafe4c Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Mon, 16 Jun 2025 13:56:22 -0500 Subject: [PATCH 18/19] Update mesh_pharmacological_action_descriptor.tmcf Remove "typeOf: schema:Drug" from mesh_pharmacological_action_descriptor.tmcf --- .../tMCFs/mesh_pharmacological_action_descriptor.tmcf | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_descriptor.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_descriptor.tmcf index 712948c4ee..301cefe6f3 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_descriptor.tmcf +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_descriptor.tmcf @@ -6,9 +6,8 @@ identifier: C:mesh_pharmacological_action_descriptor->DescriptorUI Node: E:mesh_pharmacological_action_descriptor->E2 typeOf: dcs:MeSHDescriptor -typeOf: schema:Drug dcid: C:mesh_pharmacological_action_descriptor->dcid name: C:mesh_pharmacological_action_descriptor->RecordName identifier: C:mesh_pharmacological_action_descriptor->RecordUI mechanismOfAction: E:mesh_pharmacological_action_descriptor->E1 - \ No newline at end of file + From 9486856a3e1a8fad5817a6f7e98514636e1b766e Mon Sep 17 00:00:00 2001 From: Samantha Piekos Date: Mon, 16 Jun 2025 13:57:01 -0500 Subject: [PATCH 19/19] Update mesh_pharmacological_action_descriptor.tmcf Remove extra line from mesh_pharmacological_action_descriptor.tmcf --- .../tMCFs/mesh_pharmacological_action_descriptor.tmcf | 1 - 1 file changed, 1 deletion(-) diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_descriptor.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_descriptor.tmcf index 301cefe6f3..c58c658c44 100644 --- a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_descriptor.tmcf +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_descriptor.tmcf @@ -10,4 +10,3 @@ dcid: C:mesh_pharmacological_action_descriptor->dcid name: C:mesh_pharmacological_action_descriptor->RecordName identifier: C:mesh_pharmacological_action_descriptor->RecordUI mechanismOfAction: E:mesh_pharmacological_action_descriptor->E1 -