Ronen Huang
October 2023 to December 2023, March 2025 to Present (deploy as package)
The web pages for each book of the New International Version Bible can be accessed via https://www.biblestudytools.com/.
The Bible Search Engine can be downloaded by pip.
pip install bible_search_engine
The Bible search engine can be created with the create_bible_search_engine function from pipeline.py.
For example,
from bible_search_engine.pipeline import create_bible_search_engine
bible_search_engine = create_bible_search_engine()
.
It is represented by the BibleSearchEngine class which has a search function that returns a list of ranked results for the given query. Each result is in dictionary form as
{'chapterid': chapter id, 'score': chapter score, 'chapter': chapter title, 'verses': chapter verses}
.
For example, to search for the query "How does God show his mercy?"
bible_search_engine.search('How does God show his mercy?')
yields
[
{'chapterid': 1164, 'score': 0.7548669934272766, 'chapter': '1 John 5',
'verses': {
'1': 'Everyone who believes that Jesus is the Christ is born of God, and everyone who loves the father loves his child as well.',
'2': 'This is how we know that we love the children of God: by loving God and carrying out his commands.',
...
}
},
{'chapterid': 1162, 'score': 0.7545916557312011, 'chapter': '1 John 3',
'verses': {
'1': 'See what great love the Father has lavished on us, that we should be called children of God! And that is what we are! The reason the world does not know us is that it did not know him.',
'2': 'Dear friends, now we are children of God, and what we will be has not yet been made known. But we know that when Christ appears, we shall be like him, for we shall see him as he is.',
...
}
},
...
]
.
- The
get_bible_datafunction fromget_bible_data.pyobtains the web pages of each Bible chapter to Old Testament chapters data and New Testament chapters data.- Libraries:
requeststo retrieve HTML.BeautifulSoupto retrieve text.
- Usage Example
from bible_search_engine.components.get_bible_data import get_bible_data get_bible_data()
- Libraries:
- The
NLPTokenizerclass is inpreprocess.py. Thetokenizefunction splits the Bible chapter text into tokens for indexing.- Libraries:
transformerswithgoogle/flan-t5-largemodel to expand chapters and queries with keywords.spacyto tokenize text.nltkto expand queries with synonyms.
- Usage Example
from bible_search_engine.components.preprocess import get_bible_data query = "Who is Jesus?" nlp_tokenizer = NLPTokenizer() query_tokens = nlp_tokenizer.tokenize(query)
- Libraries:
- The
BibleChapterIndexclass is inindexing.py. Thecreate_bible_indexfunction creates an inverted index by chapter.- Usage Example
from bible_search_engine.components.indexing import create_bible_index old_testament_path = 'bible_search_engine/bible_data/old_testament_niv.jsonl' new_testament_path = 'bible_search_engine/bible_data/new_testament_niv.jsonl' bible_chapter_index_path = 'bible_search_engine/bible_index_niv' bible_chapter_index = create_bible_index(old_testament_path, new_testament_path, nlp_tokenizer) bible_chapter_index.save()
- Usage Example
- The rankers
RandomRanker,TFIDFRanker,BM25Ranker,DirichletLMRanker,BiEncoderRanker,ColbertRankerinranker.pyhave aqueryfunction and ascorefunction.- Libraries:
sentence_transformerswithmsmarco-distilbert-dot-v5model to retrieve initial results with bi-encoder.qdrantwithcolbert-ir/colbertv2.0model to add ColBERT embedding similarity feature for learning to rank.
- Usage Example
from bible_search_engine.components.ranker import (TFIDFRanker, BM25Ranker, DirichletLMRanker, BiEncoderRanker, ColbertRanker) import orjson import numpy as np tf_idf_ranker = TFIDFRanker(bible_chapter_index, nlp_tokenizer) bm25_ranker = BM25Ranker(bible_chapter_index, nlp_tokenizer) dirichlet_lm_ranker = DirichletLMRanker(bible_chapter_index, nlp_tokenizer) colbert_ranker = ColbertRanker("colbert-ir/colbertv2.0", "bible_chapters") encoded_chapters_path = 'bible_search_engine/bible_data/encoded_chapters_niv.npy' chapter_ids_path = 'bible_search_engine/bible_data/chapterids_niv.json' encoded_chapters = np.load(encoded_chapters_path) with open(chapter_ids_path, 'rb') as chapter_ids_file: chapter_ids = orjson.loads(chapter_ids_file.readline()) bi_encoder_ranker = BiEncoderRanker('msmarco-distilbert-dot-v5', encoded_chapters, chapter_ids) tf_idf_results = tf_idf_ranker.query(query) bm25_results = bm25_ranker.query(query) dirichlet_lm_results = dirichlet_lm_ranker.query(query) bi_encoder_results = bi_encoder_ranker.query(query) colbert_results = colbert_ranker.query(query)
- Libraries:
- The
L2RRankerclass inl2r.pyhas atrainfunction and aqueryfunction. TheL2RFeatureExtractorhas aget_featuresfunction.- Libraries:
- Usage Example
from bible_search_engine.components.l2r import L2RRanker, L2RFeatureExtractor train_queries_path = 'bible_search_engine/bible_queries_relevances/train_queries_relevances.csv' test_queries_path = 'bible_search_engine/bible_queries_relevances/test_queries_relevances.csv' l2r_feature_extractor = L2RFeatureExtractor(bible_chapter_index, nlp_tokenizer, tf_idf_ranker, bm25_ranker, dirichlet_lm_ranker, colbert_ranker) l2r_ranker = L2RRanker(bible_chapter_index, nlp_tokenizer, bi_encoder_ranker, l2r_feature_extractor) l2r_ranker.train(train_queries_path, test_queries_path) l2r_results = l2r_ranker.query(query)
- The
Relevanceclass inrelevance.pyhas aprecisionfunction andevaluate_ranker_resultsfunction. The default is precision at 15 Bible chapters.- Usage Example
from bible_search_engine.components.relevance import Relevance test_eval = Relevance(test_queries_data) tf_idf_eval_results = test_eval.evaluate_ranker_results(tf_idf_ranker) bm25_eval_results = test_eval.evaluate_ranker_results(bm25_ranker) dirichlet_lm_eval_results = test_eval.evaluate_ranker_results(dirichlet_lm_ranker) bi_encoder_eval_results = test_eval.evaluate_ranker_results(bi_encoder_ranker) l2r_eval_results = test_eval.evaluate_ranker_results(l2r_ranker)
- Usage Example
- The bible_basic_statistics directory contains the Bible chapter statistics chapter_statistics.csv and the train and test query relevance score distribution train_rel_scores_dist.jpg and test_rel_scores_dist.jpg respectively. The process to obtain them is in
basic_bible_statistics.ipynb. - The bible_data directory contains the Old Testament chapters data old_testament_niv.jsonl, the New Testament chapters data new_testament_niv.jsonl, the encoded bible chapters by the bi-encoder encoded_chapters_niv.npy, and the chapter ids in order of the encoded bible chapters chapterids_niv.json.
- The bible_index_niv directory is the stored Bible chapter index.
- The bible_queries_relevances directories contains the train queries relevance scores data train_queries_relevances.csv and the test queries relevance scores data test_queries_relevances.csv. The process to create them is in
query_results.ipynb. - The rankers directory contains the performances of the baselines and Bible search engine on the test data results.csv and the bar plot of it results.jpg. The process to evaluate them is in
evaluate_trained_ranker.ipynb.
- New International Version for Modern English
- Predicted Relevance from Annotated Queries Data for Completeness
- Incorporated Large Language Model for Chapter and Query Expansion
- Colbert Learning to Rank Feature for Efficiency
- Tuned Hyperparameters Including Feature Extractor for Optimal Ranking
- Add Support for Different Bible Versions
- Use Online Learning
Bible Study Tools. (2025). Retrieved from https://www.biblestudytools.com
Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength Natural Language Processing in Python. doi:10.5281/zenodo.1212303
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., . . . Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 3149–3157). Curran Associates Inc.
Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Retrieved from https://arxiv.org/abs/1908.10084
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., & Zaharia, M. (2022). ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. Retrieved from https://arxiv.org/abs/2112.01488