GitHub - yjcyxky/text2knowledge: Extract entities and relationships from biomedical text and build a knowledge graph.

Text to Knolwedge Graph

Article Classification

python3 text2knowledge.py classify-article --input-file ./classfication/example.json --output-file ./classfication/results/mixtral_8x22b.json -m mixtral:8x22b

Strategy 1: Employ a LLM to extract entities and relations directly

Please refer to Prompts for more details.

If you want to extract all entities from the text, you can use the following command.

python3 text2knowledge.py extract-entities --text-file examples/text2knowledge/abstract.txt --output-file examples/text2knowledge/entities.json --model-name mistral:latest

If you want to extract all relations from the text, you can use the following command.

python3 text2knowledge.py extract-relations --text-file examples/text2knowledge/abstract.txt --output-file examples/text2knowledge/relationships.json --model-name mistral:latest

Issues

How to improve the accuracy of the entity extraction?
How to align the entities and relations? In current version, we extract entities and relations separately.
How to align all entities to the ontology items? Such as Hepatocellular carcinoma --> MONDO:0007256. You can access the BioPortal for learning more about the ontology items.

Strategy 2: Employ a LLM to extract entities and relations by asking choice questions [Not Ready Yet]

Introduction

A new solution to convert text to knowledge graph

Extract all biomedical entities from the text by using a large language model (e.g. ChatGPT4, Vicuna, etc.)
Convert all preset ontology items to embeddings
Map all extracted entities to the ontology items by computing the similarity between the embeddings, and then pick up the top N similar ontology items for each entity
Use a more precise method to re-rank the top N similar ontology items for each entity and pick up the top 1
Generate questions from the mapped ontology items. If we have ten entities, we can generate C(10, 2) = 10! / [2!(10-2)!] = (10 _ 9) / (2 _ 1) = 45 questions. We can reduce the number of questions based on our needs, such as we only care about the specific entities.
Pick up the answer for each question from the text by using a large language model (e.g. ChatGPT4, Vicuna, etc.)

Improvement plan

Fine-tune embedding algorithm for biomedical entities
Select the most suitable similarity algorithm
Select a suitable re-ranking algorithm
Improve the prompts for generating questions based on the characteristics of the large language model

Launch a Chatbot Server for Text2Knowledge

NOTE: Read the README.md in the chatbot folder for more detail [Not Ready Yet]. Or you can use another open source project Ollama or Ollama Github instead of our chatbot.

After you install the Ollama, you can run the following command to pull the models and launch the Ollama server.

Pull the models

ollama pull mistral-openorca:latest

# or
bash pull_models.sh

Launch the Ollama server

This step might not a required step for you. If you have installed the Ollama in macosx, you can also click the Ollama icon in the application folder to launch the Ollama server.

ollama serve

After you launch the Ollama server, you can open the following link in your browser to show all the availabel models.

http://127.0.0.1:11434/api/tags

[Optional] Change the storage path

If you have limited storage space in your computer, you can change the storage path to another disk. More details on how to change the storage path, please refer to the Ollama FAQs.

echo 'OLLAMA_MODELS=/path/to/your/disk' >> ~/.bashrc
source ~/.bashrc

Benchmarking

Datasets

Benchmarking Datasets and Tools for Biomedical NLP

Biomedical Datasets: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04688-w/tables/2
N2C2 NLP Dataset: https://portal.dbmi.hms.harvard.edu
BC5CDR (BioCreative V CDR corpus): https://paperswithcode.com/dataset/bc5cdr
BC4CHEMD (BioCreative IV Chemical compound and drug name recognition): https://paperswithcode.com/dataset/bc4chemd
BioNLP: https://aclanthology.org/venues/bionlp/
PubTator: https://www.ncbi.nlm.nih.gov/research/pubtator3/
BioNLP-Corpus: https://github.com/bionlp-hzau/BioNLP-Corpus
BioBERT & Bern: https://github.com/dmis-lab/bern
BioRED: https://academic.oup.com/bib/article/23/5/bbac282/6645993

References

You can refer to these papers/models/companies for more details.

Contribution Guidelines

We welcome and appreciate any contributions from the community members. If you wish to contribute to Text2Knowledge, please follow these steps:

Fork the repository and create your branch.
Make changes in your branch.
Submit a Pull Request.

Please ensure that your code adheres to the project's coding style and quality standards before submitting your contribution.

License

Text2Knowledge is released under the MIT License. For more details, please refer to the LICENSE.md file in the repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text to Knolwedge Graph

Article Classification

Strategy 1: Employ a LLM to extract entities and relations directly

Issues

Strategy 2: Employ a LLM to extract entities and relations by asking choice questions [Not Ready Yet]

Introduction

Improvement plan

Launch a Chatbot Server for Text2Knowledge

Pull the models

Launch the Ollama server

[Optional] Change the storage path

Benchmarking

Datasets

References

Contribution Guidelines

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
chatbot		chatbot
classfication		classfication
examples		examples
notebooks		notebooks
text2knowledge		text2knowledge
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
REFERENCES.md		REFERENCES.md
find_topn_chunks.py		find_topn_chunks.py
pull_models.sh		pull_models.sh
requirements.txt		requirements.txt
text2knowledge.py		text2knowledge.py

License

yjcyxky/text2knowledge

Folders and files

Latest commit

History

Repository files navigation

Text to Knolwedge Graph

Article Classification

Strategy 1: Employ a LLM to extract entities and relations directly

Issues

Strategy 2: Employ a LLM to extract entities and relations by asking choice questions [Not Ready Yet]

Introduction

Improvement plan

Launch a Chatbot Server for Text2Knowledge

Pull the models

Launch the Ollama server

[Optional] Change the storage path

Benchmarking

Datasets

References

Contribution Guidelines

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages