Salary Drivers in the Data Science Industry

Machine Learning & Statistical Analysis (2020–2024)

Project Overview

This project analyzes key salary determinants in the global data science labor market using statistical and machine learning methods.
The analysis focuses on how experience level, job specialization, company size, remote work and macroeconomic indicators influence salaries in data science-related roles.

Multiple modeling approaches are applied to identify robust drivers of salary differences and to evaluate the explanatory power and limitations of common machine learning models in an applied labor market context.

Project Status

Completed (2025). Academic research project.

Research Question

What factors significantly influence salaries in the data science industry and what insights can be derived for career decisions? The analysis aims to uncover how factors like experience, job roles, and location influence data science salaries, focusing on roles such as data scientist, engineer, and AI specialist. Insights support career planning and competitive compensation strategies.

Data Sources

Salary Data

Source: Kaggle – “Latest Data Science Job Salaries 2020–2024”
Original provider: ai-jobs.net
Scope: Global data science job listings
Size: ~14,800 observations, 11 variables
Key variables:
- Salary in USD
- Experience level
- Job title
- Employment type
- Company size
- Remote work ratio
- Work year

Macroeconomic Data

Source: World Bank / ILO
Coverage: Global unemployment rates (1990–2023)
Integration:
- Matched by country and year
- Used as a macroeconomic control variable

Methodology

Data Cleaning
- Removal of missing and inconsistent entries
- Standardization of job titles
- Handling multicollinearity using VIF and alias diagnostics
Feature Engineering
- Log-transformation of salary
- Aggregation of job titles into four categories:
  - Data Analyst
  - Data Scientist
  - Data Engineer
  - AI Engineer
- Creation of binary and count-based target variables for alternative models
Data Integration
- Merging salary data with unemployment rates
- Filtering to U.S.-based companies to avoid sample imbalance
- Final modeling dataset: >10,000 observations
Analysis & Visualization
- Linear Regression (with interactions)
- Generalized Linear Models (Poisson, Quasi-Poisson, Binomial)
- Generalized Additive Models (GAM)
- Neural Networks with cross-validation
- Diagnostic plots and residual analysis

Getting Started

Prerequisites

R (≥ 4.2 recommended)
RStudio (optional, but recommended)

Required R Packages

install.packages(c(
  "car", "caret", "countrycode", "dplyr", "e1071", "ggplot2",
  "kernlab", "lmtest", "multcomp", "neuralnet", "nnet", "plotly",
  "purrr", "randomForest", "readxl", "tidyr", "tools", "VGAM"  
))

Key Findings (Short Summary)

Experience level is the strongest salary determinant, with senior and executive roles earning substantially more.
Job specialization matters: AI Engineers earn the highest salaries, followed by Data Scientists, Data Engineers, and Data Analysts.
Company size has a modest effect, with small companies paying less on average.
Remote work ratio and unemployment rate contribute little explanatory power.
All tested models explain ~28–30% of salary variance, highlighting strong unobserved factors.

Limitations

Salary data is self-reported and aggregated from multiple sources.
Strong sample imbalance toward U.S.-based companies.
Some models rely on artificial target transformations (Poisson, Binomial).
Neural networks showed limited performance improvements despite tuning.

License

Code

The code in this repository is provided for academic and analytical purposes.

Data

The Kaggle salary dataset is subject to Kaggle’s licensing terms.

World Bank / ILO unemployment data is subject to the original data provider’s license and attribution requirements.

Author

Celina Breuer (part of a group project)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Salary Drivers in the Data Science Industry

Machine Learning & Statistical Analysis (2020–2024)

Project Overview

Project Status

Research Question

Data Sources

Salary Data

Macroeconomic Data

Methodology

Getting Started

Prerequisites

Required R Packages

Key Findings (Short Summary)

Limitations

License

Code

Data

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Salary Drivers in the Data Science Industry

Machine Learning & Statistical Analysis (2020–2024)

Project Overview

Project Status

Research Question

Data Sources

Salary Data

Macroeconomic Data

Methodology

Getting Started

Prerequisites

Required R Packages

Key Findings (Short Summary)

Limitations

License

Code

Data

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages