BookDB

Note

This repository has been archived and a newer better version of it is here.

Click on the thumbnail below to watch full demo.

This project aims to design and implement a book recommendation system using Machine Learning. The models will then be used in a consumer-facing web application to recommend books to users based on their reading history.

Poster & Report

[Report](report.pdf)

Tech Stack

Python (for the ML pipeline)
FastAPI (service that exposes the ML models to the web-app)
React (frontend for web-app)
Go (backend for web-app)
Qdrant (for storing book and user embeddings)
Postgres (for storing user data and book metadata)
Docker (for containerization)

Dataset

This project mainly uses the book metadata and user-book interactions datasets from Goodreads, available at Goodreads Dataset. Other datasets regarding authors, book_works, and csv id mapping files were used in preprocessing to prepare the data for training the models.

The required datasets are automatically downloaded into the data directory if they are not already present in the first stage of the pipeline (data preprocessing).

Model

Content-based Filtering

The best-performing model for content-based filtering is our finetuned SBERT model at sbert-output which is used in the pipeline to generate book embeddings in embeddings/sbert_embeddings.parquet which are later uploaded to a Qdrant collection to be used for similarity search (using cosine similarity as the similarity metric).

Collaborative Filtering

The best-performing model for collaborative filtering surprisingly turned out to be the Generalize Matrix Factorization model which was used in the pipeline to generate user and item embeddings in embeddings/gmf_user_embeddings.parquet and embeddings/gmf_item_embeddings.parquet respectively. These embeddings are later saved to Qdrant collections to be used for similarity search (using the dot product as the similarity metric).

Cross-Encoder

For the reranking stage, even though we didn't get the chance to integrate it into the pipeline, we finetuned a cross-encoder model available in the reranker directory.

Training and tuning

The entire pipeline from downloading the dataset to training the models and saving the embeddings is orchestrated by the main.py script. You just have to set up your python environment, install the dependencies, and run the script.

Inference

Since we were able to export static embeddings from both of our models, we saved them to Qdrant and queried them from the Go backend to produce recommendations.

Design and Development

After experimentation and exploration in jupyter notebooks, we created object-oriented pipeline step classes in the pipeline directory to make the code more modular and easier to maintain. Then, in the main script, we created a pipeline object that orchestrates the steps and runs the pipeline.

The configuration of the pipeline is set in pipeline/config.py and can be changed easily.

We iterated on a pytorch implementation of Neural Collaborative Filtering (NCF) model in the neural-collaborative-filtering directory.

We also created a Discord notifier function to send notifications to a Discord channel at different stages of the pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
api		api
light-gcn		light-gcn
neural-collaborative-filtering		neural-collaborative-filtering
nginx		nginx
notebooks		notebooks
pipeline		pipeline
reranker		reranker
scripts		scripts
sentiment_analysis_example		sentiment_analysis_example
utils		utils
website		website
.env.template		.env.template
.gitignore		.gitignore
README.md		README.md
config_loader.py		config_loader.py
discord.png		discord.png
docker-compose.dev.yaml		docker-compose.dev.yaml
docker-compose.yaml		docker-compose.yaml
main.py		main.py
report.pdf		report.pdf
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BookDB

Poster & Report

Tech Stack

Dataset

Model

Content-based Filtering

Collaborative Filtering

Cross-Encoder

Training and tuning

Inference

Design and Development

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BookDB

Poster & Report

Tech Stack

Dataset

Model

Content-based Filtering

Collaborative Filtering

Cross-Encoder

Training and tuning

Inference

Design and Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages