You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Web Scraping, Data Wrangling and Analysis on Goodreads List
Goodreads is a website where readers can rate and review the books they read. My interest in reading is what motivated me to scrape Goodreads. The aim of this project was to scrape, clean and analyse data from the Top 1000 books of the Decade: 2010's. Data has been scraped from the ‘Best Books of the Decade: 2010's’ list using the python library BeautifulSoup.
This project will be a multi-parter :
Data Collection : Building a scraper to collect and organize data from Reading Lists on Goodreads.
Data Cleaning : Cleaning and organising the scraped data.
Visualization and Analysis: Detailed visualization and analysis of the cleaned books data.
Code and Resources Used
Python Version : 3.8.2
Libraries Used : pandas, numpy, matplotlib, seaborn, plotly, cufflinks, chart_studio, bs4, random, time
We can see that the majority of the books have received a rating between 4.03 and 4.07. We can also see that the average ratings follow a normal distribution.
Above histogram give us a good indication of the distribution of the data but it don't give us much information about the outliers. We can use Boxplot to examine the outliers.
From the above histogram we can infer that the most popular genres are :
Fantasy - 23.5%
Fiction - 22.1%
Young Adult - 13.8%
These 3 Genres account for nearly 60% of total books.
Wordcloud :
Book Title :
We set the minimum word length to 4 to discard words such as 'the, is, it, an, etc'
Author :
The abover wordcloud contains names of some legendary authors such as : Stephen King and Rick Riordan and certain generic names such as John, David, Robert and James.
Description:
Usage and Interactive plots:
This project is best viewed in a notebook viewer, which can be accessed here. In this notebook, you will find a walk through of the work done, interactive plots and the respective code.
About
Web Scraping, Data Wrangling and Analysis on Top 1000 books of the decade using data from Goodreads List.