Data for this project can be found on kaggle.
Blue Bikes (formerly Hubway) is a public bike-sharing system serving the greater Boston metropolitan area. Users can participate either as members (annual or monthly) or as casual riders purchasing single-trip or day passes. Blue Bikes publishes trip-level usage data on a quarterly basis, along with a separate dataset containing station metadata.
This analysis uses a compiled version of the 2019 Blue Bikes trip data obtained from Kaggle, which aggregates the original quarterly releases published by Blue Bikes. In addition, I incorporate the official Blue Bikes station dataset to enable municipality-level analysis. Both datasets were downloaded as CSV files and imported into R as data frames for preprocessing and analysis.
The primary objective of this project is to identify and analyze patterns in bike-share usage across Boston using the 2019 Blue Bikes dataset. The analysis focuses specifically on short trips—defined as rides lasting two hours or less—which represent the vast majority of observed trips. Trips exceeding this duration account for less than 1% of the data (approximately 0.65%) and are therefore excluded to maintain analytical focus and reduce the influence of outliers.
Key areas of investigation include:
- Variations in trip duration across different municipalities
- Identification of districts that contribute most significantly to overall ridership
- Seasonal and temporal patterns in total trip volume
- Statistical exploration of trip duration, including applications of the Central Limit Theorem
- Evaluation of different sampling techniques to assess how well they represent the underlying population
The Analysis section presents a series of visualizations and statistical summaries that examine these topics in detail, highlighting notable trends and insights within the dataset.
The trip dataset contains 17 variables and over 2 million observations, while the station dataset includes 5 variables across 421 stations. To align with the project’s analytical goals, the data was filtered to include only short-duration trips (≤ 2 hours). Longer trips were excluded to improve consistency, minimize skewness, and ensure the results accurately reflect typical bike-share usage patterns.
New columns I will be adding:
tripduration_minutes: The dataset currently only has a trip duration column in seconds, I will be creating a new column to get the duration in minutesday: The data only has month and year column. I will be creating a day columndate: This will be the date of started trip without a time stamp
Prior to analysis, several preprocessing steps were performed to ensure data consistency and analytical clarity:
- Converted relevant variables to appropriate data types
- Removed trips exceeding two hours to focus on typical ride behavior and reduce the impact of outliers
- Joined the 2019 trip dataset with the Blue Bikes station dataset using the station name as the key
- Removed redundant columns created during the join process
- Created derived features (e.g., trip duration in minutes, day, and date) to support temporal and statistical analysis
I used the following R libraries/packages to conduct this analysis: readr, knitr, kableExtra, tidyverse, plotly, sampling and leaflet
Note: Some values may produce an NA value when joining. This may be because the station data is more recently updated than the trip data set (from 2019). When analyzing this data in our analysis, later on, we will omit NA.
Click here to learn more about the attributes in the data we will be analyzing
Source: Information on the attributes found on Blue Bikes and Kaggle
tripduration: duration of a bike trip in secondstripduration_minutes: column added for analysis. duration of bike trip in minutesstarttime: timestamp of start time of tripstoptime: timestamp of end time of tripstart station id: unique stationID where trip startedstart station name: name of the station at start of tripstart station latitude: latitude of start stationstart station longitude: longitude of start stationend station id: unique stationID where trip endedend station name: name of station at end of tripend station loatitude: latitude of end stationend station longitude: longitude of end stationbikeid: unique ID of bike used for the tripusertype: Customer (casual single trip or day pass user) or Subscriber (annual or monthly member)year: year when trip took place, for our data, this will all be 2019month: month when trip took place (numerical 1-12)day: created column, day of month trip took placedate: created column, date of trip in YYYY-MM-DD formatbirth year: birth year of user, this is self reportedgender: gender of user, this is self reportedstart_district: district at start of trip. Boston, Brookline Cambridge, Everett, Somerville or NAstart_total_docks: total number of docs at start of tripend_district: district at end of trip. Boston, Brookline Cambridge, Everett, Somerville or NAend_total_docks: total number of docs at end of trip
Below I included some of the plots generated in the markdown file


