Bible Chapter Search Engine

Author

Ronen Huang

Time Frame

October 2023 to December 2023, March 2025 to Present (deploy as package)

Data Source

The web pages for each book of the New International Version Bible can be accessed via https://www.biblestudytools.com/.

Install Bible Search Engine

The Bible Search Engine can be downloaded by pip.

pip install bible_search_engine

How to Run Bible Search Engine

The Bible search engine can be created with the create_bible_search_engine function from pipeline.py.

For example,

from bible_search_engine.pipeline import create_bible_search_engine

bible_search_engine = create_bible_search_engine()

.

It is represented by the BibleSearchEngine class which has a search function that returns a list of ranked results for the given query. Each result is in dictionary form as

{'chapterid': chapter id, 'score': chapter score, 'chapter': chapter title, 'verses': chapter verses}

.

For example, to search for the query "How does God show his mercy?"

bible_search_engine.search('How does God show his mercy?')

yields

[
    {'chapterid': 1164, 'score': 0.7548669934272766, 'chapter': '1 John 5',
        'verses': {
            '1': 'Everyone who believes that Jesus is the Christ is born of God, and everyone who loves the father loves his child as well.',
            '2': 'This is how we know that we love the children of God: by loving God and carrying out his commands.',
            ...
        }
    },
    {'chapterid': 1162, 'score': 0.7545916557312011, 'chapter': '1 John 3',
        'verses': {
            '1': 'See what great love the Father has lavished on us, that we should be called children of God! And that is what we are! The reason the world does not know us is that it did not know him.',
            '2': 'Dear friends, now we are children of God, and what we will be has not yet been made known. But we know that when Christ appears, we shall be like him, for we shall see him as he is.',
            ...
        }
    },
    ...
]

.

File Information

The get_bible_data function from get_bible_data.py obtains the web pages of each Bible chapter to Old Testament chapters data and New Testament chapters data.
- Libraries:
  - requests to retrieve HTML.
  - BeautifulSoup to retrieve text.
- Usage Example
```
from bible_search_engine.components.get_bible_data import get_bible_data

get_bible_data()
```
The NLPTokenizer class is in preprocess.py. The tokenize function splits the Bible chapter text into tokens for indexing.
- Libraries:
  - transformers with google/flan-t5-large model to expand chapters and queries with keywords.
  - spacy to tokenize text.
  - nltk to expand queries with synonyms.
- Usage Example
```
from bible_search_engine.components.preprocess import get_bible_data

query = "Who is Jesus?"

nlp_tokenizer = NLPTokenizer()
query_tokens = nlp_tokenizer.tokenize(query)
```

The BibleChapterIndex class is in indexing.py. The create_bible_index function creates an inverted index by chapter.

Usage Example

from bible_search_engine.components.indexing import create_bible_index

old_testament_path = 'bible_search_engine/bible_data/old_testament_niv.jsonl'
new_testament_path = 'bible_search_engine/bible_data/new_testament_niv.jsonl'

bible_chapter_index_path = 'bible_search_engine/bible_index_niv'
bible_chapter_index = create_bible_index(old_testament_path, new_testament_path, nlp_tokenizer)
bible_chapter_index.save()

The rankers RandomRanker, TFIDFRanker, BM25Ranker, DirichletLMRanker, BiEncoderRanker, ColbertRanker in ranker.py have a query function and a score function.

Libraries:
- sentence_transformers with msmarco-distilbert-dot-v5 model to retrieve initial results with bi-encoder.
- qdrant with colbert-ir/colbertv2.0 model to add ColBERT embedding similarity feature for learning to rank.

Usage Example

from bible_search_engine.components.ranker import (TFIDFRanker, BM25Ranker, DirichletLMRanker,
                                                   BiEncoderRanker, ColbertRanker)
import orjson
import numpy as np

tf_idf_ranker = TFIDFRanker(bible_chapter_index, nlp_tokenizer)
bm25_ranker = BM25Ranker(bible_chapter_index, nlp_tokenizer)
dirichlet_lm_ranker = DirichletLMRanker(bible_chapter_index, nlp_tokenizer)

colbert_ranker = ColbertRanker("colbert-ir/colbertv2.0", "bible_chapters")

encoded_chapters_path = 'bible_search_engine/bible_data/encoded_chapters_niv.npy'
chapter_ids_path = 'bible_search_engine/bible_data/chapterids_niv.json'
encoded_chapters = np.load(encoded_chapters_path)
with open(chapter_ids_path, 'rb') as chapter_ids_file:
    chapter_ids = orjson.loads(chapter_ids_file.readline())

bi_encoder_ranker = BiEncoderRanker('msmarco-distilbert-dot-v5', encoded_chapters, chapter_ids)

tf_idf_results = tf_idf_ranker.query(query)
bm25_results = bm25_ranker.query(query)
dirichlet_lm_results = dirichlet_lm_ranker.query(query)
bi_encoder_results = bi_encoder_ranker.query(query)
colbert_results = colbert_ranker.query(query)

The L2RRanker class in l2r.py has a train function and a query function. The L2RFeatureExtractor has a get_features function.

Libraries:
- lightgbm to implement learning to rank for reranking of top 200 initial results.
- optuna to tune hyperparameters learning_rate, num_iterations, max_depth, num_leaves, min_data_in_leaf.

Usage Example

from bible_search_engine.components.l2r import L2RRanker, L2RFeatureExtractor

train_queries_path = 'bible_search_engine/bible_queries_relevances/train_queries_relevances.csv'
test_queries_path = 'bible_search_engine/bible_queries_relevances/test_queries_relevances.csv'

l2r_feature_extractor = L2RFeatureExtractor(bible_chapter_index, nlp_tokenizer, tf_idf_ranker,
                                            bm25_ranker, dirichlet_lm_ranker, colbert_ranker)
l2r_ranker = L2RRanker(bible_chapter_index, nlp_tokenizer, bi_encoder_ranker, l2r_feature_extractor)
l2r_ranker.train(train_queries_path, test_queries_path)

l2r_results = l2r_ranker.query(query)

The Relevance class in relevance.py has a precision function and evaluate_ranker_results function. The default is precision at 15 Bible chapters.

Usage Example

from bible_search_engine.components.relevance import Relevance

test_eval = Relevance(test_queries_data)
tf_idf_eval_results = test_eval.evaluate_ranker_results(tf_idf_ranker)
bm25_eval_results = test_eval.evaluate_ranker_results(bm25_ranker)
dirichlet_lm_eval_results = test_eval.evaluate_ranker_results(dirichlet_lm_ranker)
bi_encoder_eval_results = test_eval.evaluate_ranker_results(bi_encoder_ranker)
l2r_eval_results = test_eval.evaluate_ranker_results(l2r_ranker)

The bible_basic_statistics directory contains the Bible chapter statistics chapter_statistics.csv and the train and test query relevance score distribution train_rel_scores_dist.jpg and test_rel_scores_dist.jpg respectively. The process to obtain them is in basic_bible_statistics.ipynb.
The bible_data directory contains the Old Testament chapters data old_testament_niv.jsonl, the New Testament chapters data new_testament_niv.jsonl, the encoded bible chapters by the bi-encoder encoded_chapters_niv.npy, and the chapter ids in order of the encoded bible chapters chapterids_niv.json.
The bible_index_niv directory is the stored Bible chapter index.
The bible_queries_relevances directories contains the train queries relevance scores data train_queries_relevances.csv and the test queries relevance scores data test_queries_relevances.csv. The process to create them is in query_results.ipynb.
The rankers directory contains the performances of the baselines and Bible search engine on the test data results.csv and the bar plot of it results.jpg. The process to evaluate them is in evaluate_trained_ranker.ipynb.

Release Notes

New International Version for Modern English
Predicted Relevance from Annotated Queries Data for Completeness
Incorporated Large Language Model for Chapter and Query Expansion
Colbert Learning to Rank Feature for Efficiency
Tuned Hyperparameters Including Feature Extractor for Optimal Ranking

Planned Improvements

Add Support for Different Bible Versions
Use Online Learning

References

Bible Study Tools. (2025). Retrieved from https://www.biblestudytools.com

Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength Natural Language Processing in Python. doi:10.5281/zenodo.1212303

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., . . . Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 3149–3157). Curran Associates Inc.

Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Retrieved from https://arxiv.org/abs/1908.10084

Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., & Zaharia, M. (2022). ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. Retrieved from https://arxiv.org/abs/2112.01488

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
pypi_page.png		pypi_page.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bible Chapter Search Engine

Author

Time Frame

Data Source

Install Bible Search Engine

How to Run Bible Search Engine

File Information

Release Notes

Planned Improvements

References

About

Uh oh!

Releases 6

Uh oh!

Languages

License

ronenh24/bible_search_engine

Folders and files

Latest commit

History

Repository files navigation

Bible Chapter Search Engine

Author

Time Frame

Data Source

Install Bible Search Engine

How to Run Bible Search Engine

File Information

Release Notes

Planned Improvements

References

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 6

Uh oh!

Languages