NCATSTranslator · gaurav · May 7, 2025 · Jun 5, 2024 · Jun 5, 2024 · Jun 5, 2024
diff --git a/README.md b/README.md
@@ -14,88 +14,107 @@ conventions established by the [biolink model](https://github.com/biolink/biolin
 at runtime by querying the [Biolink Model service](https://github.com/TranslatorIIPrototypes/bl_lookup).  Each semantic type (such as 
 chemical substance) requires specialized processing, but in each case, a 
 JSON-formatted compendium is written to disk.  This compendium can be used 
-directly, but it can also be served via the [Node Normalization service](https://github.com/TranslatorIIPrototypes/NodeNormalization).
+directly, but it can also be served by the [Node Normalization service](https://github.com/TranslatorSRI/NodeNormalization)
+or another frontend.
 
 We anticipate that the simple approach taken here will soon be overtaken by
 more advanced probabilistic procedures, so caution should be taken in building
 strong dependencies against the Babel code.
 
 ## Configuration
 
-Before running, edit `config.json` and set the `babel_downloads` and `babel_output` directories.  Do not edit the
-remaining items, which are used to control the build process.
+The [`./kubernetes`](./kubernetes/README.md) directory contains Kubernetes manifest files
+that can be used to set up a Pod to run Babel in. They'll give you an idea of the disk
+space and memory requirements needed to run this pipeline.
 
-Also, if building the disease/phenotype compendia, there are two files that 
-must be obtained with the user's UMLS license.  In particular `MRCONSO.RRF` 
-and `MRSTY.RRF` should be placed in `/babel/input_data/private`.
+Before running, read through `config.json` and make sure that the settings look correct.
+You will need to update the version numbers of some databases that need to be downloaded,
+or change the download and output directories.
+
+A UMLS API key is required in order to download UMLS and RxNorm databases. You will need
+to set the `UMLS_API_KEY` environmental variable to a UMLS API key, which you can obtain
+by creating a profile on the [UMLS Terminology Services website](https://uts.nlm.nih.gov/uts).
 
 ## Building Compendia
 
 Compendia building is managed by snakemake.  To build, for example, the anatomy related compendia, run
 
 ```snakemake --cores 1 anatomy```
 
-Currently, the following targets build compendia:
+Currently, the following targets build compendia and synonym files:
 * anatomy
 * chemical
 * disease
 * gene
-* protein
 * genefamily
+* protein
+* macromolecular_complex
 * taxon
 * process
+* publications
+
+And these two build conflations:
 * geneprotein
+* drugchemical
 
 Each target builds one or more compendia corresponding to a biolink model category.  For instance, the anatomy target 
 builds compendia for `biolink:AnatomicalEntity`, `biolink:Cell`, `biolink:CellularComponent`, and `biolink:GrossAnatomicalStructure`.
 
+You can also just run:
+
+```snakemake --cores 1```
+
+without a target to create all the files that are produced as part of Babel, including all reports and
+alternate exports.
+
+If you have multiple CPUs available, you can increase the number of `--cores` to run multiple steps in parallel.
+
 ## Build Process
 
 The information contained here is not required to create the compendia, but may be useful to understand.  The build process is 
 divided into two parts:
 
-1. Pulling data from external sources and parsing it independent of use
+1. Pulling data from external sources and parsing it independent of use.
 2. Extracting and combining entities for specific types from these downloaded data sets.
 
-This distinction is made because a single data set, such as MESH may contain entities of many different types and may be 
+This distinction is made because a single data set, such as MeSH or UMLS may contain entities of many different types and may be 
 used by many downstream targets.
 
 ### Pulling Data
 
 The datacollection snakemake file coordinates pulling data from external sources into a local filesystem.  Each data source 
-has a module in `src.datahandlers`.  Data goes into the babel_downloads directory, in subdirectories named by the curie prefix
+has a module in `src/datahandlers`.  Data goes into the `babel_downloads` directory, in subdirectories named by the curie prefix
 for that data set.  If the directory is misnamed and does not match the prefix, then labels will not be added to the identifiers
 in the final compendium.
 
-Once data is assembled, we attempt to create 2 extra files for each data source: `labels` and `synonyms`.   `labels` is a two
-column tab-delimited file. The first column is a curie identifier from the data source, and the second column is the label
-from that data set.  Each entity should only appear once in the `labels` file.
-The `labels` file for a data set does not subset the data for a specific purpose, but contains all 
-labels for any entity in that data set.  
+Once data is assembled, we attempt to create two extra files for each data source: `labels` and `synonyms`. `labels` is
+a two-column tab-delimited file. The first column is a CURIE identifier from the data source, and the second column is the
+label from that data set.  Each entity should only appear once in the `labels` file. The `labels` file for a data set
+does not subset the data for a specific purpose, but contains all labels for any entity in that data set. 
 
 `synonyms` contains other lexical names for the entity and is a 3-column tab-delimited file, with the second column
-indicating the type of synonym (exact, related, xref, etc..)
+indicating the type of synonym (exact, related, xref, etc.)
 
 ### Creating compendia
 
 The individual details of creating a compendium vary, but all follow the same essential pattern.  
 
 First, we extract the identifiers that will be used in the compendia from each data source that will contribute, and
-places them into a directory.  For instance, in the build of the chemical compendium, these ids are placed into 
-`/babel_downloads/chemical/ids`. Each file is a two column file containing curie identifiers in column 1, and the biolink
-category for that entity in column 2.  
+place them into a directory.  For instance, in the build of the chemical compendium, these ids are placed into 
+`/babel_downloads/chemical/ids`. Each file is a two-column file containing curie identifiers in column 1, and the
+Biolink type for that entity in column 2.  
 
-Second, we create pairwise concords across vocabularies.  These are places in e.g. `babel_downloads/chemical/concords`. 
-Each concord is a 3 column file of the format
+Second, we create pairwise concords across vocabularies. These are placed in e.g. `babel_downloads/chemical/concords`. 
+Each concord is a three-column file of the format:
 
 `<curie1> <relation> <curie2>`
 
-While the relation is currently unused, future versions of babel may use the relation in building cliques.
+While the relation is currently unused, future versions of Babel may use the relation in building cliques.
 
 Third, the compendia is built by bringing together the ids and concords, pulling in the categories from the id files, 
 and the labels from the label files.
 
-Fourth, the compendia is assessed to make sure that all of the ids in the id files made into one of the possibly multiple 
+Fourth, the compendia is assessed to make sure that all the ids in the id files made into one of the possibly multiple 
 compendia.  The compendia are further assessed to locate large cliques and display the level of vocabulary merging.
 
 ## Building with Docker
@@ -135,20 +154,22 @@ create three resources:
   the internet.
 * `kubernetes/babel-outputs.k8s.yaml` creates a PVC for storing the output files generated by Babel. This includes
   compendia, synonym files, reports and intermediate files.
-* `kubernetes/babel-private.k8s.yaml` creates a PVC for storing the private files (see `babel/input_data/private`
-  described above). Once this PVC has been set up, you will need to copy those private files into it yourself.
 * `kubernetes/babel.k8s.yaml` creates a pod running the latest Docker image from ggvaidya/babel. Rather than running
   the data generation automatically, you are expected to SSH into this pod and start the build process by:
-  1. Creating a [screen](https://www.gnu.org/software/screen/) to run the program in. You can start a Screen by
+  1. Edit the script `scripts/babel-build.sh` to clear the `DRY_RUN` property so that it doesn't , i.e.:
+     ```shell
+     export DRY_RUN=
+     ```
+  2. Creating a [screen](https://www.gnu.org/software/screen/) to run the program in. You can start a Screen by
      running:
 
      ```shell
      $ screen
      ```
-  2. Starting the Babel build process by running:
+  3. Starting the Babel build process by running:
 
      ```shell
-     $ snakemake -c5 --verbose
+     $ bash scripts/babel-build.sh
      ```
 
      Ideally, this should produce the entire Babel output in a single run. You can also add `--rerun-incomplete` if you
@@ -157,6 +178,6 @@ create three resources:
      To help with debugging, the Babel image includes .git information. You can switch branches, or fetch new branches
      from GitHub by running `git fetch origin-https`.
 
-  3. Press `Ctrl+A D` to "detach" the screen. You can reconnect to a detached screen by running `screen -r`.
+  4. Press `Ctrl+A D` to "detach" the screen. You can reconnect to a detached screen by running `screen -r`.
      You can also see a list of all running screens by running `screen -l`.
-  4. Once the generation completes, all output files should be in the `babel_outputs` directory.
+  5. Once the generation completes, all output files should be in the `babel_outputs` directory.
diff --git a/docs/Conflation.md b/docs/Conflation.md
@@ -0,0 +1,30 @@
+# Babel Conflation
+
+Babel is designed to produce cliques of _identical_ identifiers, but our users would sometimes like to combine 
+identifiers that are similar in some other way. Babel generates "conflations" to support this.
+
+Babel currently generates two conflations:
+1. GeneProtein conflates gene with the protein transcribed from it.
+   The gene identifier will always be returned.
+2. DrugChemical conflates drugs with their active ingredients as a chemical. For each conflation we attempt to
+   determine a Biolink type, and arrange the identifiers in order of (1) preferred prefix order for that Biolink
+   type, followed by (2) ordering identifiers from the numerically smallest suffix to the numerically largest 
+   suffix.
+
+## How are conflations generated in Babel and used in NodeNorm?
+
+Each conflation file is a JSON-Lines (JSONL) file, where every line is a JSON list of clique identifiers, which are
+stored in Redis databases in NodeNorm. If a particular conflation is turned on, NodeNorm will:
+1. Normalize the input identifier to a clique identifier.
+2. If the clique identifier is not part of any conflation, we return it as-is.
+3. If the clique identifier is part of a conflation, we construct a new clique whose preferred identifier is the first
+   identifier in the clique, and which consists of all the identifiers from all the cliques included in that conflation.
+
+## How are types handled for conflated cliques?
+
+Babel does not assign a type to any conflations. When NodeNorm is called with a particular conflation turned on,
+it determines the types of a conflated clique by:
+1. Starting with the most specific type of the first identifier in the conflation.
+2. Adding all the supertypes of the most specific type for the first identifier in the conflation as determined
+   by the [Biolink Model Toolkit](https://github.com/biolink/biolink-model-toolkit).
+3. Add all the types and ancestors for all the other identifiers in the conflation without duplication.