Skip to content

ThomasChatzopoulos/weak_labeling_snorkel_xgboost

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Weakly-Supervised Machine Learning Method for fine-grained semantic indexing of biomedical literature


About this repository

This repository includes the final (third) part of my Thesis [1] on the fine-grained semantic indexing of biomedical literature.

The first and second parts involve the development of a large-scale dataset and its enhancement with weak supervision and are part of a paper in the Journal of Biomedical Informatics (JBI) [2] (an extended version of this study [3] is also available). The implementation of the previous parts is available here.

In the last part of the paper, Anastasios Nentidis uses a Deep Learning approach, while in my thesis I focus on a Machine Learning method, an XGBoost model (as well as a logistic regression model as a baseline).

Below, I've included the summary of the work with brief explanations about the repository code. For more details and analysis, please take a look at the above references and sources.

Requirements

  • These scripts are written in Python 3.8.2.
  • Libraries and versions required are listed in requirements.txt.
  • Available disk space for the entire project: at least 350 GB.

Introduction

Abstract

The semantic indexing of the biomedical literature in MEDLINE/PubMed is performed with descriptors from the MeSH treasure, which represent specific concepts of the biomedical community. Synonymous or related biomedical concepts are often together and represented only by a coarse-grained descriptor, based on which the corresponding bibliography is also indexed. In this work, a method is developed for the automated improvement of biomedical concepts by exploring machine learning approaches. Due to the absence of labeled data, weak supervision techniques are used based on the occurrence of the concept in the text of the articles. The evaluation of the method is performed retrospectively, on data for concepts that have been gradually promoted to fine-grained descriptors in the MeSH treasure and thus used to annotate and index the articles. Although concept occurrence in article text is a powerful heuristic for fine-grained article indexing, experiments show that combining it with other, simpler heuristics can, in some cases, further strengthen it. Using heuristics to develop weakly supervised machine learning models can further improve the results. Overall, the proposed method succeeds in improving the indexing of biomedical literature to fine-grained concepts in an automated manner for most of the use cases.

Overview

Semantic indexing of biomedical literature refers to the annotation of articles with labels from a thesaurus containing biomedical terminology. As a database of articles-citations is used the MEDLINE/PubMed, in which the articles are indexed with topic descriptors from the Medical Subject Headings (MeSH) treasure.

Articles annotation

Fig 1. Annotation of PubMed articles with descriptors from the MeSH treasure.

Among other factors, the considerable growth in the volume of bibliographic references in recent years has accentuated the necessity for annotating articles with fine-grained labels (MeSH Headings), in contrast to the previously utilized coarse-grained ones. Furthermore, there is a growing imperative to advance the automation of annotation and related processes.

The work is structured as follows:

1. Dataset development: The dataset creation is based on a retrospective scenario, using the concept-occurrence in the title or abstract of an article as a heuristic. An evaluation of a previous method [4] using the appearance-concept heuristic was also performed on a small dataset.

2. Weakly-supervised dataset enhancement: The weakly-supervised enhancement of the dataset is achieved by combining a number of heuristics, beyond the concept occurrence.

3. ML models development: The development of machine learning models (XGBoost & Logistic Regression) for automated suggestion of fine-grained headings in biomedical literature, instead of coarse-grained ones.

Articles annotation

Fig 2. Work overview: (1) The large-scale ground-truth dataset development based on the Retrospective Beyond MeSH method [4], the weakly-labeled training dataset development with concept-occurrence heuristic, (2) the weakly-supervised enhancement of the datasets using dictionary-based approaches, and finally (3) the Machine Learning models development.

The methods & how to run

1. Dataset development

The analysis of this part has been carried out in another repository. The implementation of the development of the datasets has been carried out in Java, while the datasets are stored in JSON format.

In summary: i. For a given range of years, the MeSH versions are compared and the concepts that have been promoted to fine-grained descriptors are returned.

ii. The use-cases are selected, applying some constraints so that the new descriptors express a concept in a more detailed, but distinct way, compared to the previous version of the MeSH treasure.

iii. The relevant articles are selected; for the training dataset, the Concept-Οccurrence (CO) heuristic in the title or abstract is used, while the articles of the test dataset are already strongly indexed with the desired concept (ground-truth data).

An example of indexing with the CO heuristic is described in Fig. 3, while a portion of a dataset is captured in Fig. 4.

Note: this is a multi-label problem.

Articles annotation

Fig 3. The CO label of the article with PMID: 15855629, which was published in PubMed in 2004 and is indexed under the "Dolphins" MeSH Heading, is the descriptor “Bottle-Nosed Dolphin”, which belongs to the detailed descriptors of 2006, and is also identified as a concept in the text of the article. The concepts of the article are identified using the MetaMap tool.

Articles annotation

Fig 4. An article with its weak label in the training dataset in JSON format.

2. Weakly-supervised dataset enhancement

The second part of the work is the enhancement of the original weak supervision provided by CO, investigating a range of dictionary-based variants and whether their combination could enhance the quality of the weak labels. Each dictionary-based labeling function (LF), based on the name or synonyms of the concept of the fine-grained descriptor, assigns a label for concept c to an article if any dictionary element associated with c literally occurs in the title or abstract of the article. Α total of 8 alternative heuristics arise, in addition to CO, an example of which is presented in Table 1.

Labeling Function (LF) Dictionary-based variants
NE name exact "Bottle-Nosed Dolphin"
SE synonyms exact "Tursiops truncatus", ...
NL name lowercase "bottle-nosed dolphin"
SL synonyms lowercase "tursiops truncatus", ...
NNP name no punctuation "bottle nosed dolphin"
SNP synonyms no punctuation "tursiops truncatus", ...
NT name tokens "bottle", "nosed", "dolphin"
ST synonyms tokens "tursiops", "truncatus", ...

Table 1. Example of applying the labeling functions for the label "Bottle-Nosed Dolphin".

The next challenge is how these 9 weak labels can be combined into 1 strong label for each concept of an article. The 2 main questions here are:

  1. How will the heuristics be combined to produce a strong label?
  2. Will all 9 heuristics be used? If not, then which subset of them will be used?

Three approaches were considered for combining the labeling functions:

  • Majority Voting (MV): assigns a label to an article if most of the LFs (voters) assign this label to this article
  • At-Least-One Voting (ALOV): assigns a label to an article if any of the LFs assign this label to this article
  • Label Model of Snorkel (LM): a state-of-the-art probabilistic approach for combining noisy heuristics, introduced in the context of Snorkel; it considers the true label for an article as a latent variable in a probabilistic generative model, which is estimated based on a weighted combination of the LFs

Figure 5 shows an example of how these 3 approaches work to combine 3 labeling functions (CO, SL, ST) into 1 strong label.

Articles annotation

Fig 5. The operation of the 3 approaches (MV, ALOV, LM) for combining 3 labeling functions (CO, SL, ST) into 1 strong label on four biomedical articles & the "Bottle-Nosed Dolphin" concept. The value '1' indicates that the label is assigned, while the value '0' indicates that it is not assigned.


The 2006 dataset was used for validation experiments, in order to make design choices regarding the architecture and configurations of the method, while the datasets for unseen fine-grained labels introduced in the subsequent years (2007–2019 datasets) were used in the “evaluation experiments” (Section 4.3), which aimed to confirm the choices made on the 2006 datasets.

Validation experiments on 2006 datasets showed that 3 LFs are sufficient, while ALOV was chosen as a better approximation for combining the LFs.

All the scripts about this part of the method can be found under the "snorkel_labeling" folder.

The main file of the method is the "snorkel_labeling/snorkel_model.py"

The description of the labeling functions as code is in "snorkel_labeling/LFs.py" file.

3. ML models development


References

[1] Χατζόπουλος, Θ. (2024). Λεπτομερής σημασιολογική ευρετηρίαση σε βιοϊατρική βιβλιογραφία. ΝΗΜΕΡΤΗΣ, Ιδρυματικό Αποθετήριο πανεπιστημίου Πατρών, 2024, https://hdl.handle.net/10889/27628

[2] Nentidis, A., Chatzopoulos, T., Krithara, A., Tsoumakas, G., & Paliouras, G. (2023). Large-scale investigation of weakly-supervised deep learning for the fine-grained semantic indexing of biomedical literature. Journal of Biomedical Informatics, Volume 146, 2023, 104499, ISSN 1532-0464, https://doi.org/10.1016/j.jbi.2023.104499.

[3] Nentidis, A., Chatzopoulos, T., Krithara, A., Tsoumakas, G., & Paliouras, G. (2023). Large-scale fine-grained semantic indexing of biomedical literature based on weakly-supervised deep learning. arXiv preprint (An extended version of [2]). https://arxiv.org/pdf/2301.09350v1.pdf

[4] Nentidis, A., Chatzopoulos, T., Krithara, A., Tsoumakas, G., & Paliouras, G. (2020). Beyond MeSH: Fine-grained semantic indexing of biomedical literature based on weak supervision. Information Processing & Management, Volume 57, Issue 5, 2020, 102282, ISSN 0306-4573, https://doi.org/10.1016/j.ipm.2020.102282

Releases

No releases published

Packages

 
 
 

Contributors

Languages