Live App: Streamlit Dashboard
Project Repo: GitHub Repository
LinkedIn: Project Inception Post
- 3.1 User Stories
- 3.2 Mapping to ML & Visualization
- 3.3 Requirements Table
- 3.4 Stretch Goal: Clustering
- Hypotheses & Validation
- Datasets
- Data & Model Artefacts
- Data Collection & Preparation
- Analytical & ML Tasks
- ML Business Case
- Dashboard Design
- Project Structure & Notebooks
- Deployment & Local Development
- Model Evaluation & Business Impact
- References & Attribution
- Bug Fixes
- Test & Coverage
- Quick Start
- Hugging Face Integration & Setup
Purpose:
Bookwise Analytics is a data-driven recommendation system for a subscription-based book club. The goal is to optimize book selection and user engagement using Machine Learning (ML), replacing intuition-based curation with predictive analytics. The project delivers a Streamlit dashboard for stakeholders to explore insights, model outputs, and diversity metrics.
Target Audience:
Business stakeholders, data practitioners, and editorial teams seeking to maximize user satisfaction and retention in a book subscription service.
Despite a stable subscriber base, engagement and credit redemption rates are declining due to poor book-member matches. The business needs to identify drivers of engagement and predict which books will maximize satisfaction and retention.
- Identify book and features linked to higher engagement.
- Predict high-engagement titles using historical data.
- Simulate retention uplift from algorithmic recommendations.
- Safeguard genre diversity and fairness in recommendations.
- Users: Receive better-matched book recommendations, increasing engagement.
- Business: Reduces churn, improves catalog utilization, and supports scalable editorial processes.
- Editorial: Focuses curation on high-impact and diverse titles.
| ID | Requirement | Success Indicator | Dataset(s) | Linked Dashboard Page |
|---|---|---|---|---|
| BR-1 | Identify features correlated with engagement | Correlation ≥ 0.4 | BBE | Analytics Explorer |
| BR-2 | Predict high-engagement titles | Model RMSE < 1.0 or R² > 0.7 | BBE, Goodbooks | Model Runner |
| BR-3 | Estimate retention uplift from recommendations | Simulated uplift ≥ 10% | BBE, Goodbooks | Recommendation Comparison |
| BR-4 | Maintain diversity/fairness in recommendations | Shannon Entropy ≥ editorial baseline | BBE, Goodbooks | Diversity Metrics |
The following user stories are implemented and tracked via GitHub issues.
Each story includes ML tasks, actions, and acceptance criteria.
High Engagement Titles
As an editorial team member, I want to see which books are predicted to have high engagement, so I can focus curation efforts.
Engagement Uplift Prediction
As a business stakeholder, I want to compare editorial vs. model-driven recommendations to understand uplift, so I can make informed decisions.
Feature Importance for Engagement
As a business stakeholder, I want to understand which book features drive engagement, so I can optimize catalog selection.
Genre Fairness
As a stakeholder, I want to ensure recommendations maintain genre diversity and fairness, so I don't alienate any user segments.
Summary Dashboard
As a stakeholder, I want an executive summary page showing KPIs and project overview, so I can quickly assess performance.
Title Acquisition
As a user, I want to search for any book and see its predicted engagement score, so I can guide title acquisition decisions.
| User Story | ML Task / Visualization | Actions Required |
|---|---|---|
| High Engagement Titles | Engagement prediction, leaderboard | Model scoring, leaderboard table |
| Engagement Uplift Prediction | Editorial vs. model uplift metric | Display sets, calculate predicition for each set, calculate uplift |
| Feature Importance for Engagement | Feature importance analysis | Train model, extract importances, visualize, actionable insights |
| Genre Fairness | Genre diversity/fairness metrics | Compute shares, entropy, visualize |
| Summary Dashboard | Executive KPIs dashboard | Aggregate KPIs, overview, navigation |
| Title Acquisition | Search + engagement prediction | Search bar, engagement prediction |
| ID | Requirement | Success Indicator | Dataset(s) | Linked Dashboard Page |
|---|---|---|---|---|
| BR-1 | Identify features correlated with engagement | Correlation ≥ 0.4 | BBE | Analytics Explorer |
| BR-2 | Predict high-engagement titles | Model RMSE < 1.0 or R² > 0.7 | BBE, Goodbooks | Model Runner |
| BR-3 | Estimate retention uplift from recommendations | Simulated uplift ≥ 10% | BBE, Goodbooks | Recommendation Comparison |
| BR-4 | Maintain diversity/fairness in recommendations | Shannon Entropy ≥ editorial baseline | BBE, Goodbooks | Diversity Metrics |
If these thresholds are not met, the corresponding ML task is considered unsuccessful and is not recommended for operational use.
As an additional feature, this project implements user clustering using KMeans to segment members based on their reading behavior and preferences. This segmentation helps identify distinct user groups and supports more targeted marketing and personalization strategies.
- Features Used:
Aggregated user-level features such as average pages per book, number of genres read, genre diversity, genre concentration, top genre share, and number of interactions. - Preprocessing:
Missing values are imputed (numerical: median, categorical: mode), categorical features are one-hot encoded, and all features are standardized. - Algorithm:
KMeans clustering is applied to the processed features. The optimal number of clusters is determined using the silhouette score and elbow method. - Cluster Profiles:
Analysis revealed two main user segments:- Cluster 0: Genre Specialists
- Fewer ratings overall
- Higher average rating per book
- Preference for newer and longer books
- Less genre diversity, more focused on a single genre
- Cluster 1: Genre Explorers
- More ratings overall
- Slightly lower average rating per book
- Preference for older and shorter books
- Higher genre diversity, less focused on a single genre
- Cluster 0: Genre Specialists
- Genre Specialists may respond well to targeted recommendations within their favorite genres and new releases.
- Genre Explorers may appreciate diverse recommendations and discovery-oriented features.
- Cluster assignments and profiles are available in the dashboard's "Member Insights" page.
- The clustering workflow, evaluation, and business rationale are fully documented in
notebooks/07_Member_Cluster.ipynb.
This segmentation enables more personalized engagement strategies and actionable insights for marketing and editorial teams.
| ID | Hypothesis | Validation Method | Outcome/Conclusion |
|---|---|---|---|
| H1 | Books with high external ratings have higher engagement | Correlation, regression | Confirmed: r > 0.4 |
| H2 | Historical rating/review patterns predict engagement with ~80% accuracy | Regression, collaborative filtering | Confirmed: RMSE < 1.0, R² > 0.7 |
| H3 | Recent publications yield higher satisfaction | Feature importance, correlation | Partially confirmed |
| H4 | Algorithmic selection increases engagement by ≥10% over editorial/random | Uplift simulation | Confirmed: ≥10% uplift |
See dashboard and notebooks for statistical evidence and plots supporting these conclusions.
| Dataset | Source & Link | Purpose |
|---|---|---|
| Best Books Ever | GitHub | Catalog metadata, ratings |
| Goodbooks-10k | GitHub | User behavior, ratings |
| Overlap (BBE ∩ GB) | Derived | Metadata linking |
| Open Library API | API | Metadata enrichment |
| Google Books API | API | Metadata enrichment |
To ensure reproducibility and keep the repository lightweight, large datasets and trained model artefacts are hosted on Hugging Face.
-
Bookwise Analytics – Modeling Dataset
https://huggingface.co/datasets/revolucia/bookwise-analytics-ml
Cleaned and feature-engineered dataset used for engagement modeling. -
Book Club User Clusters
https://huggingface.co/revolucia/bookclub-cluster
Precomputed user segmentation results for member insights.
- Popularity Score Model (ExtraTreesRegressor)
https://huggingface.co/revolucia/popularity-score-model
Trained regression model and evaluation metrics used by the Streamlit application.
- Data Collection: Jupyter notebooks fetch and audit datasets, including API enrichment.
- Data Cleaning: Handle missing values, standardize fields, and merge datasets.
- Feature Engineering: Create popularity, metadata scores and log-transform skewed features.
- Data Preparation: Documented in
/notebooks/(see structure below).
| Requirement | Task | Notebook(s) | Outcome/Metric |
|---|---|---|---|
| BR-1 | Correlation, feature importance | 04_Exploratory_Data_Analysis | Key predictors identified |
| BR-2 | Regression, hybrid recommender | 06_Modeling | RMSE, R², MAE |
| BR-3 | Uplift simulation | 06_Modeling, dashboard | % uplift in engagement |
| BR-4 | Genre entropy, diversity metrics | 04_Exploratory_Data_Analysis | Entropy, genre coverage |
- Aim: Predict book engagement to optimize recommendations and retention.
- Learning Method: Regression (RandomForest, XGBoost) and clustering.
- Success Metrics: RMSE < 1.0, R² > 0.7, ≥10% uplift, high genre entropy.
- Model Output: Predicted engagement scores, top-K recommendations.
- Relevance: Directly supports business KPIs for engagement and retention.
Pages:
- Executive Summary: KPIs (RMSE, R², uplift), summary plots. (BR-2, BR-3)
- Analytics Explorer: Correlations, feature importance, genre diversity. (BR-1, BR-4)
- Recommendation Comparison: Model vs. editorial vs. random selections, uplift plots. (BR-3)
- Model Runner: Top 10 books by predicted engagement. (BR-2)
- Diversity Metrics: Genre entropy, coverage plots. (BR-4)
Each page includes textual interpretation of plots and clear statements on model performance.
/notebooks
├── 01_Data_Collection.ipynb
├── 02_Data_Cleaning.ipynb
├── 03_Data_Enrichment_and_Dataset_Integration.ipynb
├── 04_Exploratory_Data_Analysis.ipynb
├── 05_Feature_Engineering.ipynb
├── 06_Modeling.ipynb
├── 07_Member_Cluster.ipynb
- Each notebook starts with objectives, inputs, and outputs.
- Data preparation and feature engineering are clearly documented.
- Streamlit app:
streamlit run app.py - Heroku deployment: See instructions in this README.
- Environment: Python, pip, virtualenv, requirements.txt, Procfile, setup.sh maintained.
- Version Control: All code managed in GitHub with clear commit history.
The Bookwise Analytics recommendation engine was evaluated using multiple regression models, with the ExtraTreesRegressor selected for deployment due to its strong performance:
- Test R²: 0.81
- Test RMSE: 0.95
- Test MAE: 0.57
These results exceed the business success criteria (RMSE < 1.0 or R² > 0.7), confirming the model's reliability for predicting book engagement.
Feature importance analysis shows that external popularity signals (e.g., number of ratings, votes, composite popularity score) are the strongest predictors, with publication recency and select metadata features providing additional value.
On the dashboard's Recommendation Comparison page, the model-driven approach achieved:
- Simulated Uplift: 42.13% (vs. editorial selection, far exceeding the 10% target)
- Genre Entropy: 2.32 (no loss in diversity compared to editorial curation)
This demonstrates that algorithmic recommendations can substantially increase predicted engagement while maintaining genre diversity and fairness.
- Higher Engagement: Model recommendations are expected to significantly boost user engagement and credit redemption.
- Diversity Maintained: No reduction in genre diversity or fairness.
- Actionable Insights: Editorial teams gain transparency into drivers of engagement and can focus on high-impact titles.
For full details and supporting analysis, see:
notebooks/06_Modeling.ipynb— Model training, evaluation, and feature importance- Dashboard > Recommendation Comparison — Simulated uplift and genre entropy metrics
app_pages/page_recommendation_comparison.py— Implementation of comparison logic
- All datasets are open-source and referenced.
- External code and resources are credited in code comments and this README.
- See References section for full list.
| Ticket Title | Error Description | Resolution Description |
|---|---|---|
| DtypeWarning when loading CSVs | Mixed data types in columns causing warnings | Specified dtypes when reading CSVs |
| Application error | Heroku deployment failed due to conflicting packages | Refactored python pyproject.toml |
| ppscore dependency issue | ppscore package causing installation errors | Refactored python pyproject.toml |
| WEBP images from HF not displaying | Streamlit not rendering WEBP images from HF | Removed empty space from image URL |
| Deployment error | Heroku deployment failed due to missing packages | Added sklearn to pyproject.toml |
| File not found | Path pointing to local file | Hosted version uses hosted file. |
| Feature importance chart: Infinite extent warning | Altair/Streamlit chart warning due to missing or invalid data in feature importance CSV | Data validation and user-friendly error handling added. Warning is harmless since charts display correctly, so further action was deferred. |
| Vega-Lite compatibility | Console warning due to Vega-Lite version mismatch between Altair (v5.x) and Streamlit frontend (v6.x). | All charts render and function as expected, so package upgrades were deferred to avoid unnecessary dependency complexity. The warning can be safely ignored unless future issues arise. |
- Test results and coverage details are available in
documentation/TEST.md. - Summary:
All 88 tests pass successfully (pytest).
Overall code coverage is 41%, with 100% coverage for core modeling pipeline code.
Most cleaning and feature engineering utilities are well covered.
Some analysis and EDA modules have low or no coverage—see the full report for details.
- Clone repo:
git clone https://github.com/larevolucia/bookwise-analytics.git - Set up
.envwith API keys. - Install dependencies:
pip install -e ".[dev,viz,ml]"
requirements.txt is optimized for deployment; use
pyproject.tomlfor local development.
- Run notebooks for data prep and modeling.
- Launch Streamlit:
streamlit run app.py
This project uses two Hugging Face repositories for seamless data and model management:
-
Datasets & Plots:
Repository type:dataset
Stores processed datasets and EDA plots for reproducibility and sharing. -
Models:
Repository type:model
Stores trained model artifacts for deployment and inference.
- Sign up at Hugging Face.
- Go to Access Tokens and create a Write token for each repository.
- Save both tokens in your
.envfile:HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx HUGGINGFACE_MODEL_TOKEN=hf_yyyyyyyyyyyyyyyyyyyyyyyyyyyy HUGGINGFACE_CLUSTER_TOKEN=hf_zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
- Clone or create a new dataset repo (e.g.,
bookwise-analytics-ml). - Upload datasets and EDA plots using the Hugging Face Hub CLI:
huggingface-cli login --token $HUGGINGFACE_TOKEN huggingface-cli repo create bookwise-analytics-ml --type dataset huggingface-cli upload ./data/ --repo-type dataset --repo-id <your-username>/bookwise-analytics-ml
- Clone or create a new model repo (e.g.,
popularity-score-model). - Upload model files:
huggingface-cli login --token $HUGGINGFACE_MODEL_TOKEN huggingface-cli repo create popularity-score-model --type model huggingface-cli upload ./models/ --repo-type model --repo-id <your-username>/popularity-score-model
-
Use the
datasetslibrary to load datasets and thehuggingface_hublibrary to download model artifacts directly:# Load dataset from Hugging Face Hub from datasets import load_dataset dataset = load_dataset("revolucia/bookwise-analytics-ml") # Download a specific model file from the model repo from huggingface_hub import hf_hub_download model_path = hf_hub_download( repo_id="revolucia/popularity-score-model", filename="modeling_data/et_model.pkl" )
-
Adjust the
filenameargument to match the actual path in the model repo (e.g.,"modeling_data/et_model.pkl"). -
For more details, see the Hugging Face Hub documentation.
This project uses the Google Books API to enrich book metadata (pages, publisher, publication date, description, categories, etc.) for titles missing information after merging core datasets.
Setup:
- Go to the Google Cloud Console.
- Create a new project (guide).
- Enable the Google Books API (
APIs & Services > Library). - Generate an API key (
APIs & Services > Credentials).
Add the key and URLs to your .env file:
GOOGLE_BOOKS_API_KEY=<YOUR_KEY>Usage in notebooks:
- The API is queried by ISBN (preferred) or by title/author for books without ISBNs.
- Results are cached in
data/raw/google_api_cache.jsonto avoid duplicate requests and manage quota. - See
03_Data_Enrichment_and_Dataset_Integration.ipynbfor implementation details.
The Open Library API is used as the first enrichment source for missing metadata, since it is public and does not require authentication or API keys.
Usage in notebooks:
- Queried by ISBN for missing fields (pages, language, publisher, description, subjects).
- Results are cached in
data/raw/openlibrary_api_cache.json. - See
03_Data_Enrichment_and_Dataset_Integration.ipynbfor code and enrichment workflow.
- Regex101: Online regex tester and debugger.
- Text Cleaning in Python: A guide on cleaning text data using Python.
- Pandas Documentation: datetime, combine_first,
- NumPy: exponential, logarithm, arange
- DateUtils Documentation: for advanced date parsing.
- TQDM Documentation: for progress bars in loops.
- Requests Documentation: for making HTTP requests.
- Pandas Merging: for combining DataFrames.
- Scikit-Learn Documentation: for machine learning algorithms and evaluation metrics.
- Conventional Commits: for consistent commit messages.
- On Gaussian Distribution: Free Code Camp, Quantinsti, GeeksForGeeks Machine Learning, eeksForGeeks Python, PennState College
- On Binnin Data: GeeksForGeeks
- Sentence Transformers: for generating text embeddings.
- Pytest Documentation: for testing framework in Python.
- geeksforgeeks.org: Combinations: for generating combinations.
- Scikit-learn: for ML models and pipelines.
- Displayr: Learn What Are Residuals
- Medium: Understanding Residual Analysis in Regression
- GeeksForGeeks: Residual Analysis
- Introduction to SHAP Values for Machine Learning Interpretability
- SHAP Documentation
- Hugging face
- Open Library API documentation
- Google Books API documentation
- Testing Models with Pytest
- ChatGPT: used to refine and correct grammar and clarity of textual explanations in the README and notebooks, and to improve the wording of existing comments and docstrings without altering code logic or functionality.
- GitHub Copilot (free version): used as an IDE assistance tool to suggest boilerplate patterns and standard syntax for data processing and visualisation (e.g. library usage and plotting scaffolds). All suggestions were critically reviewed, adapted, and integrated by the author, with full understanding of the resulting code.
- NotebookLM: used as a learning guide on data cleaning and modeling concepts to help identify possible next steps and guide through documentation without providing or generating any code.
All data processing, analysis, modeling, and application code were written and implemented by the author.
Thanks to all who provided feedback and support during this project, including peers, mentors, and the open-source community.
Cold Start Model for Metadata Features Importance:
Develop a model to estimate engagement or recommendation quality using only metadata features (e.g., genre, author, publication year) for new books with no external popularity signals (ratings, reviews, etc.).
Hybrid Model for Collaborative Filtering:
Explore a hybrid approach that combines collaborative filtering (user-item interactions) with content-based features to improve recommendations, especially for users or books with sparse data.
New Feature Engineering:
Investigate additional features such as text embeddings from book descriptions, author popularity or publisher reputation via wikipedia or social media signals to enhance model performance.
Best Seller Data Integration:
Incorporate external best seller lists (e.g. NYT, Amazon) as features to capture market trends and further improve engagement predictions.