Note
This repository has been archived and a newer better version of it is here.
Click on the thumbnail below to watch full demo.

This project aims to design and implement a book recommendation system using Machine Learning. The models will then be used in a consumer-facing web application to recommend books to users based on their reading history.
[Report](report.pdf)
- Python (for the ML pipeline)
- FastAPI (service that exposes the ML models to the web-app)
- React (frontend for web-app)
- Go (backend for web-app)
- Qdrant (for storing book and user embeddings)
- Postgres (for storing user data and book metadata)
- Docker (for containerization)
This project mainly uses the book metadata and user-book interactions datasets from Goodreads, available at Goodreads Dataset. Other datasets regarding authors, book_works, and csv id mapping files were used in preprocessing to prepare the data for training the models.
The required datasets are automatically downloaded into the data directory if they are not already present in the first stage of the pipeline (data preprocessing).
The best-performing model for content-based filtering is our finetuned SBERT model at sbert-output which is used in the pipeline to generate book embeddings in embeddings/sbert_embeddings.parquet which are later uploaded to a Qdrant collection to be used for similarity search (using cosine similarity as the similarity metric).
The best-performing model for collaborative filtering surprisingly turned out to be the Generalize Matrix Factorization model which was used in the pipeline to generate user and item embeddings in embeddings/gmf_user_embeddings.parquet and embeddings/gmf_item_embeddings.parquet respectively. These embeddings are later saved to Qdrant collections to be used for similarity search (using the dot product as the similarity metric).
For the reranking stage, even though we didn't get the chance to integrate it into the pipeline, we finetuned a cross-encoder model available in the reranker directory.
The entire pipeline from downloading the dataset to training the models and saving the embeddings is orchestrated by the main.py script. You just have to set up your python environment, install the dependencies, and run the script.
Since we were able to export static embeddings from both of our models, we saved them to Qdrant and queried them from the Go backend to produce recommendations.
After experimentation and exploration in jupyter notebooks, we created object-oriented pipeline step classes in the pipeline directory to make the code more modular and easier to maintain. Then, in the main script, we created a pipeline object that orchestrates the steps and runs the pipeline.
The configuration of the pipeline is set in pipeline/config.py and can be changed easily.
We iterated on a pytorch implementation of Neural Collaborative Filtering (NCF) model in the neural-collaborative-filtering directory.
We also created a Discord notifier function to send notifications to a Discord channel at different stages of the pipeline.
