This project uses unsupervised machine learning, specifically K-means clustering, to identify betting patterns and categorise players by playing style in the online game Bustabit.
- Data
- Feature Engineering
- Data Cleaning & Preprocessing
- Principal Component Analysis
- Clustering Analysis
- Dependencies
- The game: Bustabit is a Bitcoin crash game launched in 2014. Players choose how much to wager before each game starts, then watch a multiplier increase and attempt to cash out at the highest multiplier before the game randomly busts. The player wins their stake multiplied by the multiplier at the point they cashed out; if the game busts before they cash out, they lose their stake.
- The dataset was sourced from Kaggle and covers games between October and December 2016 – in total, just over 42,000 unique games and 4,000 players are included.
- Each row in the dataset represents one player’s result in a single game; consequently, a game with multiple players is represented by multiple rows. Data include the amount wagered, cash-out multiplier, profit and the eventual bust multiplier for the game.
- Since the data simply represent one player’s outcome in one game, the data were manipulated to engineer a range of numeric features per player, which could be used as inputs for machine learning models to discern patterns.
- These features included totals and averages for games and sessions played, wins and bet size, as well as number of games played before and after the first bust per day, and average cash-out and bust multipliers.
- Data cleaning: null values are an inherent feature of the data, as some players never lost and others never won, and were imputed as zero to capture this information.
- The data were scaled using the sklearn StandardScaler to account for unit scale differences between features.
- Principal Component Analysis was performed to see if the features could be condensed into categories with minimal loss of information, with the benefit of reducing multicollinearity.
- Ultimately too many components were required to ensure limited information loss (i.e. an explained variance ratio >0.9), and so the analysis was performed without decomposition.
- 4 clusters: evaluation of the sum-of-squared-errors (plotting an elbow curve) suggested the dataset of players could best be summarised by 4 categories.
- Using simple visualisations to compare the clusters across different metrics provided clear insights into the distinct player types (see presentation).
- The player types identified by clustering were labelled:
- Addicts
- Suckers
- One-shot wonders
- High-rollers
Example visualisations of key cluster characteristics:
- In contrast, identifying clear patterns without the use of clusters would be much more difficult, even between key features:
Overall, this project provided an instructive example of how unsupervised machine learning can be used to categorise behavioral patterns otherwise difficult to extract from raw data.
- Python 3.x
- Sklearn
- Data preprocessing
- PCA
- K-means clustering
- Pandas
- Numpy
- Datetime
- Matplotlib & seaborn




