Skip to content

celina-breuer/machine-learning-salary-determinants

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Salary Drivers in the Data Science Industry

Machine Learning & Statistical Analysis (2020–2024)

Project Overview

This project analyzes key salary determinants in the global data science labor market using statistical and machine learning methods.
The analysis focuses on how experience level, job specialization, company size, remote work and macroeconomic indicators influence salaries in data science-related roles.

Multiple modeling approaches are applied to identify robust drivers of salary differences and to evaluate the explanatory power and limitations of common machine learning models in an applied labor market context.


Project Status

Completed (2025). Academic research project.


Research Question

What factors significantly influence salaries in the data science industry and what insights can be derived for career decisions? The analysis aims to uncover how factors like experience, job roles, and location influence data science salaries, focusing on roles such as data scientist, engineer, and AI specialist. Insights support career planning and competitive compensation strategies.


Data Sources

Salary Data

  • Source: Kaggle – “Latest Data Science Job Salaries 2020–2024”
  • Original provider: ai-jobs.net
  • Scope: Global data science job listings
  • Size: ~14,800 observations, 11 variables
  • Key variables:
    • Salary in USD
    • Experience level
    • Job title
    • Employment type
    • Company size
    • Remote work ratio
    • Work year

Macroeconomic Data

  • Source: World Bank / ILO
  • Coverage: Global unemployment rates (1990–2023)
  • Integration:
    • Matched by country and year
    • Used as a macroeconomic control variable

Methodology

  1. Data Cleaning

    • Removal of missing and inconsistent entries
    • Standardization of job titles
    • Handling multicollinearity using VIF and alias diagnostics
  2. Feature Engineering

    • Log-transformation of salary
    • Aggregation of job titles into four categories:
      • Data Analyst
      • Data Scientist
      • Data Engineer
      • AI Engineer
    • Creation of binary and count-based target variables for alternative models
  3. Data Integration

    • Merging salary data with unemployment rates
    • Filtering to U.S.-based companies to avoid sample imbalance
    • Final modeling dataset: >10,000 observations
  4. Analysis & Visualization

    • Linear Regression (with interactions)
    • Generalized Linear Models (Poisson, Quasi-Poisson, Binomial)
    • Generalized Additive Models (GAM)
    • Neural Networks with cross-validation
    • Diagnostic plots and residual analysis

Getting Started

Prerequisites

  • R (≥ 4.2 recommended)
  • RStudio (optional, but recommended)

Required R Packages

install.packages(c(
  "car", "caret", "countrycode", "dplyr", "e1071", "ggplot2",
  "kernlab", "lmtest", "multcomp", "neuralnet", "nnet", "plotly",
  "purrr", "randomForest", "readxl", "tidyr", "tools", "VGAM"  
))

Key Findings (Short Summary)

  • Experience level is the strongest salary determinant, with senior and executive roles earning substantially more.
  • Job specialization matters: AI Engineers earn the highest salaries, followed by Data Scientists, Data Engineers, and Data Analysts.
  • Company size has a modest effect, with small companies paying less on average.
  • Remote work ratio and unemployment rate contribute little explanatory power.
  • All tested models explain ~28–30% of salary variance, highlighting strong unobserved factors.

Limitations

  • Salary data is self-reported and aggregated from multiple sources.
  • Strong sample imbalance toward U.S.-based companies.
  • Some models rely on artificial target transformations (Poisson, Binomial).
  • Neural networks showed limited performance improvements despite tuning.

License

Code

The code in this repository is provided for academic and analytical purposes.

Data

The Kaggle salary dataset is subject to Kaggle’s licensing terms.

World Bank / ILO unemployment data is subject to the original data provider’s license and attribution requirements.


Author

Celina Breuer (part of a group project)

About

This project applies machine learning models to identify salary drivers in the data science labor market. Tools: data cleaning, feature engineering, multicollinearity, linear models, GLMs, GAMs, NN.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors