Skip to content

ismahahmed/Analyzing-Blue-Bike-Trips

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

Analysis of Boston Blue Bikes Data in 2019

Data for this project can be found on kaggle.

Introduction

Data Origin and Overview

Blue Bikes (formerly Hubway) is a public bike-sharing system serving the greater Boston metropolitan area. Users can participate either as members (annual or monthly) or as casual riders purchasing single-trip or day passes. Blue Bikes publishes trip-level usage data on a quarterly basis, along with a separate dataset containing station metadata.

This analysis uses a compiled version of the 2019 Blue Bikes trip data obtained from Kaggle, which aggregates the original quarterly releases published by Blue Bikes. In addition, I incorporate the official Blue Bikes station dataset to enable municipality-level analysis. Both datasets were downloaded as CSV files and imported into R as data frames for preprocessing and analysis.

Goal of Analysis

The primary objective of this project is to identify and analyze patterns in bike-share usage across Boston using the 2019 Blue Bikes dataset. The analysis focuses specifically on short trips—defined as rides lasting two hours or less—which represent the vast majority of observed trips. Trips exceeding this duration account for less than 1% of the data (approximately 0.65%) and are therefore excluded to maintain analytical focus and reduce the influence of outliers.

Key areas of investigation include:

  • Variations in trip duration across different municipalities
  • Identification of districts that contribute most significantly to overall ridership
  • Seasonal and temporal patterns in total trip volume
  • Statistical exploration of trip duration, including applications of the Central Limit Theorem
  • Evaluation of different sampling techniques to assess how well they represent the underlying population

The Analysis section presents a series of visualizations and statistical summaries that examine these topics in detail, highlighting notable trends and insights within the dataset.

Data Prep

The trip dataset contains 17 variables and over 2 million observations, while the station dataset includes 5 variables across 421 stations. To align with the project’s analytical goals, the data was filtered to include only short-duration trips (≤ 2 hours). Longer trips were excluded to improve consistency, minimize skewness, and ensure the results accurately reflect typical bike-share usage patterns.

New columns I will be adding:

  • tripduration_minutes : The dataset currently only has a trip duration column in seconds, I will be creating a new column to get the duration in minutes
  • day : The data only has month and year column. I will be creating a day column
  • date : This will be the date of started trip without a time stamp

Prior to analysis, several preprocessing steps were performed to ensure data consistency and analytical clarity:

  • Converted relevant variables to appropriate data types
  • Removed trips exceeding two hours to focus on typical ride behavior and reduce the impact of outliers
  • Joined the 2019 trip dataset with the Blue Bikes station dataset using the station name as the key
  • Removed redundant columns created during the join process
  • Created derived features (e.g., trip duration in minutes, day, and date) to support temporal and statistical analysis

I used the following R libraries/packages to conduct this analysis: readr, knitr, kableExtra, tidyverse, plotly, sampling and leaflet

Note: Some values may produce an NA value when joining. This may be because the station data is more recently updated than the trip data set (from 2019). When analyzing this data in our analysis, later on, we will omit NA.


Click here to learn more about the attributes in the data we will be analyzing

Source: Information on the attributes found on Blue Bikes and Kaggle

  • tripduration : duration of a bike trip in seconds
  • tripduration_minutes : column added for analysis. duration of bike trip in minutes
  • starttime: timestamp of start time of trip
  • stoptime : timestamp of end time of trip
  • start station id : unique stationID where trip started
  • start station name : name of the station at start of trip
  • start station latitude : latitude of start station
  • start station longitude : longitude of start station
  • end station id : unique stationID where trip ended
  • end station name : name of station at end of trip
  • end station loatitude : latitude of end station
  • end station longitude : longitude of end station
  • bikeid : unique ID of bike used for the trip
  • usertype : Customer (casual single trip or day pass user) or Subscriber (annual or monthly member)
  • year : year when trip took place, for our data, this will all be 2019
  • month : month when trip took place (numerical 1-12)
  • day : created column, day of month trip took place
  • date : created column, date of trip in YYYY-MM-DD format
  • birth year : birth year of user, this is self reported
  • gender : gender of user, this is self reported
  • start_district : district at start of trip. Boston, Brookline Cambridge, Everett, Somerville or NA
  • start_total_docks : total number of docs at start of trip
  • end_district : district at end of trip. Boston, Brookline Cambridge, Everett, Somerville or NA
  • end_total_docks : total number of docs at end of trip

Below I included some of the plots generated in the markdown file

newplot image newplot (2)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published