Skip to content

larevolucia/bookwise-analytics

Repository files navigation

Bookwise Analytics: Predictive Book Recommendation System

Live App: Streamlit Dashboard

Project Repo: GitHub Repository

LinkedIn: Project Inception Post

Table of Contents

  1. Project Overview
  2. Business Understanding
  1. Business Requirements & Mapping
  1. Hypotheses & Validation
  2. Datasets
  3. Data & Model Artefacts
  4. Data Collection & Preparation
  5. Analytical & ML Tasks
  6. ML Business Case
  7. Dashboard Design
  8. Project Structure & Notebooks
  9. Deployment & Local Development
  10. Model Evaluation & Business Impact
  11. References & Attribution
  12. Bug Fixes
  13. Test & Coverage
  14. Quick Start
  15. Hugging Face Integration & Setup
  1. Google Books API
  2. Open Library API
  3. References

1. Project Overview

Purpose:
Bookwise Analytics is a data-driven recommendation system for a subscription-based book club. The goal is to optimize book selection and user engagement using Machine Learning (ML), replacing intuition-based curation with predictive analytics. The project delivers a Streamlit dashboard for stakeholders to explore insights, model outputs, and diversity metrics.

Target Audience:
Business stakeholders, data practitioners, and editorial teams seeking to maximize user satisfaction and retention in a book subscription service.


2. Business Understanding

2.1. Problem Statement

Despite a stable subscriber base, engagement and credit redemption rates are declining due to poor book-member matches. The business needs to identify drivers of engagement and predict which books will maximize satisfaction and retention.

2.2. Business Objectives

  • Identify book and features linked to higher engagement.
  • Predict high-engagement titles using historical data.
  • Simulate retention uplift from algorithmic recommendations.
  • Safeguard genre diversity and fairness in recommendations.

2.3. Stakeholder Benefits

  • Users: Receive better-matched book recommendations, increasing engagement.
  • Business: Reduces churn, improves catalog utilization, and supports scalable editorial processes.
  • Editorial: Focuses curation on high-impact and diverse titles.

3. Business Requirements & Mapping

ID Requirement Success Indicator Dataset(s) Linked Dashboard Page
BR-1 Identify features correlated with engagement Correlation ≥ 0.4 BBE Analytics Explorer
BR-2 Predict high-engagement titles Model RMSE < 1.0 or R² > 0.7 BBE, Goodbooks Model Runner
BR-3 Estimate retention uplift from recommendations Simulated uplift ≥ 10% BBE, Goodbooks Recommendation Comparison
BR-4 Maintain diversity/fairness in recommendations Shannon Entropy ≥ editorial baseline BBE, Goodbooks Diversity Metrics

3.1. User Stories

The following user stories are implemented and tracked via GitHub issues.
Each story includes ML tasks, actions, and acceptance criteria.

High Engagement Titles
As an editorial team member, I want to see which books are predicted to have high engagement, so I can focus curation efforts.

Engagement Uplift Prediction
As a business stakeholder, I want to compare editorial vs. model-driven recommendations to understand uplift, so I can make informed decisions.

Feature Importance for Engagement
As a business stakeholder, I want to understand which book features drive engagement, so I can optimize catalog selection.

Genre Fairness
As a stakeholder, I want to ensure recommendations maintain genre diversity and fairness, so I don't alienate any user segments.

Summary Dashboard
As a stakeholder, I want an executive summary page showing KPIs and project overview, so I can quickly assess performance.

Title Acquisition
As a user, I want to search for any book and see its predicted engagement score, so I can guide title acquisition decisions.


3.2. Mapping to ML & Visualization

User Story ML Task / Visualization Actions Required
High Engagement Titles Engagement prediction, leaderboard Model scoring, leaderboard table
Engagement Uplift Prediction Editorial vs. model uplift metric Display sets, calculate predicition for each set, calculate uplift
Feature Importance for Engagement Feature importance analysis Train model, extract importances, visualize, actionable insights
Genre Fairness Genre diversity/fairness metrics Compute shares, entropy, visualize
Summary Dashboard Executive KPIs dashboard Aggregate KPIs, overview, navigation
Title Acquisition Search + engagement prediction Search bar, engagement prediction

3.3. Requirements Table

ID Requirement Success Indicator Dataset(s) Linked Dashboard Page
BR-1 Identify features correlated with engagement Correlation ≥ 0.4 BBE Analytics Explorer
BR-2 Predict high-engagement titles Model RMSE < 1.0 or R² > 0.7 BBE, Goodbooks Model Runner
BR-3 Estimate retention uplift from recommendations Simulated uplift ≥ 10% BBE, Goodbooks Recommendation Comparison
BR-4 Maintain diversity/fairness in recommendations Shannon Entropy ≥ editorial baseline BBE, Goodbooks Diversity Metrics

If these thresholds are not met, the corresponding ML task is considered unsuccessful and is not recommended for operational use.


3.4. Stretch Goal: Clustering

As an additional feature, this project implements user clustering using KMeans to segment members based on their reading behavior and preferences. This segmentation helps identify distinct user groups and supports more targeted marketing and personalization strategies.

Clustering Approach

  • Features Used:
    Aggregated user-level features such as average pages per book, number of genres read, genre diversity, genre concentration, top genre share, and number of interactions.
  • Preprocessing:
    Missing values are imputed (numerical: median, categorical: mode), categorical features are one-hot encoded, and all features are standardized.
  • Algorithm:
    KMeans clustering is applied to the processed features. The optimal number of clusters is determined using the silhouette score and elbow method.
  • Cluster Profiles:
    Analysis revealed two main user segments:
    • Cluster 0: Genre Specialists
      • Fewer ratings overall
      • Higher average rating per book
      • Preference for newer and longer books
      • Less genre diversity, more focused on a single genre
    • Cluster 1: Genre Explorers
      • More ratings overall
      • Slightly lower average rating per book
      • Preference for older and shorter books
      • Higher genre diversity, less focused on a single genre

Business Interpretation

  • Genre Specialists may respond well to targeted recommendations within their favorite genres and new releases.
  • Genre Explorers may appreciate diverse recommendations and discovery-oriented features.

Outputs

  • Cluster assignments and profiles are available in the dashboard's "Member Insights" page.
  • The clustering workflow, evaluation, and business rationale are fully documented in notebooks/07_Member_Cluster.ipynb.

This segmentation enables more personalized engagement strategies and actionable insights for marketing and editorial teams.


4. Hypotheses & Validation

ID Hypothesis Validation Method Outcome/Conclusion
H1 Books with high external ratings have higher engagement Correlation, regression Confirmed: r > 0.4
H2 Historical rating/review patterns predict engagement with ~80% accuracy Regression, collaborative filtering Confirmed: RMSE < 1.0, R² > 0.7
H3 Recent publications yield higher satisfaction Feature importance, correlation Partially confirmed
H4 Algorithmic selection increases engagement by ≥10% over editorial/random Uplift simulation Confirmed: ≥10% uplift

See dashboard and notebooks for statistical evidence and plots supporting these conclusions.


5. Datasets

Dataset Source & Link Purpose
Best Books Ever GitHub Catalog metadata, ratings
Goodbooks-10k GitHub User behavior, ratings
Overlap (BBE ∩ GB) Derived Metadata linking
Open Library API API Metadata enrichment
Google Books API API Metadata enrichment

6. Data & Model Artefacts

To ensure reproducibility and keep the repository lightweight, large datasets and trained model artefacts are hosted on Hugging Face.

Datasets

Models


7. Data Collection & Preparation

  • Data Collection: Jupyter notebooks fetch and audit datasets, including API enrichment.
  • Data Cleaning: Handle missing values, standardize fields, and merge datasets.
  • Feature Engineering: Create popularity, metadata scores and log-transform skewed features.
  • Data Preparation: Documented in /notebooks/ (see structure below).

8. Analytical & ML Tasks

Requirement Task Notebook(s) Outcome/Metric
BR-1 Correlation, feature importance 04_Exploratory_Data_Analysis Key predictors identified
BR-2 Regression, hybrid recommender 06_Modeling RMSE, R², MAE
BR-3 Uplift simulation 06_Modeling, dashboard % uplift in engagement
BR-4 Genre entropy, diversity metrics 04_Exploratory_Data_Analysis Entropy, genre coverage

9. ML Business Case

  • Aim: Predict book engagement to optimize recommendations and retention.
  • Learning Method: Regression (RandomForest, XGBoost) and clustering.
  • Success Metrics: RMSE < 1.0, R² > 0.7, ≥10% uplift, high genre entropy.
  • Model Output: Predicted engagement scores, top-K recommendations.
  • Relevance: Directly supports business KPIs for engagement and retention.

10. Dashboard Design

Pages:

  • Executive Summary: KPIs (RMSE, R², uplift), summary plots. (BR-2, BR-3)
  • Analytics Explorer: Correlations, feature importance, genre diversity. (BR-1, BR-4)
  • Recommendation Comparison: Model vs. editorial vs. random selections, uplift plots. (BR-3)
  • Model Runner: Top 10 books by predicted engagement. (BR-2)
  • Diversity Metrics: Genre entropy, coverage plots. (BR-4)

Each page includes textual interpretation of plots and clear statements on model performance.


11. Project Structure & Notebooks

/notebooks
├── 01_Data_Collection.ipynb
├── 02_Data_Cleaning.ipynb
├── 03_Data_Enrichment_and_Dataset_Integration.ipynb
├── 04_Exploratory_Data_Analysis.ipynb
├── 05_Feature_Engineering.ipynb
├── 06_Modeling.ipynb
├── 07_Member_Cluster.ipynb
  • Each notebook starts with objectives, inputs, and outputs.
  • Data preparation and feature engineering are clearly documented.

12. Deployment & Local Development

  • Streamlit app: streamlit run app.py
  • Heroku deployment: See instructions in this README.
  • Environment: Python, pip, virtualenv, requirements.txt, Procfile, setup.sh maintained.
  • Version Control: All code managed in GitHub with clear commit history.

13. Model Evaluation & Business Impact

Summary

The Bookwise Analytics recommendation engine was evaluated using multiple regression models, with the ExtraTreesRegressor selected for deployment due to its strong performance:

  • Test R²: 0.81
  • Test RMSE: 0.95
  • Test MAE: 0.57

These results exceed the business success criteria (RMSE < 1.0 or R² > 0.7), confirming the model's reliability for predicting book engagement.

Feature importance analysis shows that external popularity signals (e.g., number of ratings, votes, composite popularity score) are the strongest predictors, with publication recency and select metadata features providing additional value.

Recommendation Uplift & Diversity

On the dashboard's Recommendation Comparison page, the model-driven approach achieved:

  • Simulated Uplift: 42.13% (vs. editorial selection, far exceeding the 10% target)
  • Genre Entropy: 2.32 (no loss in diversity compared to editorial curation)

This demonstrates that algorithmic recommendations can substantially increase predicted engagement while maintaining genre diversity and fairness.

Business Impact

  • Higher Engagement: Model recommendations are expected to significantly boost user engagement and credit redemption.
  • Diversity Maintained: No reduction in genre diversity or fairness.
  • Actionable Insights: Editorial teams gain transparency into drivers of engagement and can focus on high-impact titles.

For full details and supporting analysis, see:


14. References & Attribution

  • All datasets are open-source and referenced.
  • External code and resources are credited in code comments and this README.
  • See References section for full list.

14. Bug Fixes

Ticket Title Error Description Resolution Description
DtypeWarning when loading CSVs Mixed data types in columns causing warnings Specified dtypes when reading CSVs
Application error Heroku deployment failed due to conflicting packages Refactored python pyproject.toml
ppscore dependency issue ppscore package causing installation errors Refactored python pyproject.toml
WEBP images from HF not displaying Streamlit not rendering WEBP images from HF Removed empty space from image URL
Deployment error Heroku deployment failed due to missing packages Added sklearn to pyproject.toml
File not found Path pointing to local file Hosted version uses hosted file.
Feature importance chart: Infinite extent warning Altair/Streamlit chart warning due to missing or invalid data in feature importance CSV Data validation and user-friendly error handling added. Warning is harmless since charts display correctly, so further action was deferred.
Vega-Lite compatibility Console warning due to Vega-Lite version mismatch between Altair (v5.x) and Streamlit frontend (v6.x). All charts render and function as expected, so package upgrades were deferred to avoid unnecessary dependency complexity. The warning can be safely ignored unless future issues arise.

16. Test & Coverage

  • Test results and coverage details are available in documentation/TEST.md.
  • Summary:
    All 88 tests pass successfully (pytest).
    Overall code coverage is 41%, with 100% coverage for core modeling pipeline code.
    Most cleaning and feature engineering utilities are well covered.
    Some analysis and EDA modules have low or no coverage—see the full report for details.

17. Quick Start

  1. Clone repo:
    git clone https://github.com/larevolucia/bookwise-analytics.git
  2. Set up .env with API keys.
  3. Install dependencies:
    pip install -e ".[dev,viz,ml]"

requirements.txt is optimized for deployment; use pyproject.toml for local development.

  1. Run notebooks for data prep and modeling.
  2. Launch Streamlit:
    streamlit run app.py

18. Hugging Face Integration & Setup

This project uses two Hugging Face repositories for seamless data and model management:

  • Datasets & Plots:
    Repository type: dataset
    Stores processed datasets and EDA plots for reproducibility and sharing.

  • Models:
    Repository type: model
    Stores trained model artifacts for deployment and inference.

18.1. Create Accounts & Tokens

  • Sign up at Hugging Face.
  • Go to Access Tokens and create a Write token for each repository.
  • Save both tokens in your .env file:
    HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    HUGGINGFACE_MODEL_TOKEN=hf_yyyyyyyyyyyyyyyyyyyyyyyyyyyy
    HUGGINGFACE_CLUSTER_TOKEN=hf_zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
    

18.2. Datasets & Plots Repo

  • Clone or create a new dataset repo (e.g., bookwise-analytics-ml).
  • Upload datasets and EDA plots using the Hugging Face Hub CLI:
    huggingface-cli login --token $HUGGINGFACE_TOKEN
    huggingface-cli repo create bookwise-analytics-ml --type dataset
    huggingface-cli upload ./data/ --repo-type dataset --repo-id <your-username>/bookwise-analytics-ml

18.3. Models Repo

  • Clone or create a new model repo (e.g., popularity-score-model).
  • Upload model files:
    huggingface-cli login --token $HUGGINGFACE_MODEL_TOKEN
    huggingface-cli repo create popularity-score-model --type model
    huggingface-cli upload ./models/ --repo-type model --repo-id <your-username>/popularity-score-model

18.4. Using HF in Code

  • Use the datasets library to load datasets and the huggingface_hub library to download model artifacts directly:

    # Load dataset from Hugging Face Hub
    from datasets import load_dataset
    dataset = load_dataset("revolucia/bookwise-analytics-ml")
    
    # Download a specific model file from the model repo
    from huggingface_hub import hf_hub_download
    model_path = hf_hub_download(
        repo_id="revolucia/popularity-score-model",
        filename="modeling_data/et_model.pkl"
    )
  • Adjust the filename argument to match the actual path in the model repo (e.g., "modeling_data/et_model.pkl").

  • For more details, see the Hugging Face Hub documentation.


19. Google Books API

This project uses the Google Books API to enrich book metadata (pages, publisher, publication date, description, categories, etc.) for titles missing information after merging core datasets.

Setup:

  1. Go to the Google Cloud Console.
  2. Create a new project (guide).
  3. Enable the Google Books API (APIs & Services > Library).
  4. Generate an API key (APIs & Services > Credentials).

Add the key and URLs to your .env file:

GOOGLE_BOOKS_API_KEY=<YOUR_KEY>

Usage in notebooks:

  • The API is queried by ISBN (preferred) or by title/author for books without ISBNs.
  • Results are cached in data/raw/google_api_cache.json to avoid duplicate requests and manage quota.
  • See 03_Data_Enrichment_and_Dataset_Integration.ipynb for implementation details.

20. Open Library API

The Open Library API is used as the first enrichment source for missing metadata, since it is public and does not require authentication or API keys.

Usage in notebooks:

  • Queried by ISBN for missing fields (pages, language, publisher, description, subjects).
  • Results are cached in data/raw/openlibrary_api_cache.json.
  • See 03_Data_Enrichment_and_Dataset_Integration.ipynb for code and enrichment workflow.

21. References

Use of AI Tools

  • ChatGPT: used to refine and correct grammar and clarity of textual explanations in the README and notebooks, and to improve the wording of existing comments and docstrings without altering code logic or functionality.
  • GitHub Copilot (free version): used as an IDE assistance tool to suggest boilerplate patterns and standard syntax for data processing and visualisation (e.g. library usage and plotting scaffolds). All suggestions were critically reviewed, adapted, and integrated by the author, with full understanding of the resulting code.
  • NotebookLM: used as a learning guide on data cleaning and modeling concepts to help identify possible next steps and guide through documentation without providing or generating any code.

All data processing, analysis, modeling, and application code were written and implemented by the author.

Acknowledgements

Thanks to all who provided feedback and support during this project, including peers, mentors, and the open-source community.

Potential Next Steps

Cold Start Model for Metadata Features Importance:

Develop a model to estimate engagement or recommendation quality using only metadata features (e.g., genre, author, publication year) for new books with no external popularity signals (ratings, reviews, etc.).

Hybrid Model for Collaborative Filtering:

Explore a hybrid approach that combines collaborative filtering (user-item interactions) with content-based features to improve recommendations, especially for users or books with sparse data.

New Feature Engineering:

Investigate additional features such as text embeddings from book descriptions, author popularity or publisher reputation via wikipedia or social media signals to enhance model performance.

Best Seller Data Integration:

Incorporate external best seller lists (e.g. NYT, Amazon) as features to capture market trends and further improve engagement predictions.

About

Code Institute Portfolio 5 Project: Predictive Analytics

Resources

License

MIT, CC0-1.0 licenses found

Licenses found

MIT
LICENSE
CC0-1.0
DATA_LICENSE

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors