This project analyzes key salary determinants in the global data science labor market using statistical and machine learning methods.
The analysis focuses on how experience level, job specialization, company size, remote work and macroeconomic indicators influence salaries in data science-related roles.
Multiple modeling approaches are applied to identify robust drivers of salary differences and to evaluate the explanatory power and limitations of common machine learning models in an applied labor market context.
Completed (2025). Academic research project.
What factors significantly influence salaries in the data science industry and what insights can be derived for career decisions? The analysis aims to uncover how factors like experience, job roles, and location influence data science salaries, focusing on roles such as data scientist, engineer, and AI specialist. Insights support career planning and competitive compensation strategies.
- Source: Kaggle – “Latest Data Science Job Salaries 2020–2024”
- Original provider: ai-jobs.net
- Scope: Global data science job listings
- Size: ~14,800 observations, 11 variables
- Key variables:
- Salary in USD
- Experience level
- Job title
- Employment type
- Company size
- Remote work ratio
- Work year
- Source: World Bank / ILO
- Coverage: Global unemployment rates (1990–2023)
- Integration:
- Matched by country and year
- Used as a macroeconomic control variable
-
Data Cleaning
- Removal of missing and inconsistent entries
- Standardization of job titles
- Handling multicollinearity using VIF and alias diagnostics
-
Feature Engineering
- Log-transformation of salary
- Aggregation of job titles into four categories:
- Data Analyst
- Data Scientist
- Data Engineer
- AI Engineer
- Creation of binary and count-based target variables for alternative models
-
Data Integration
- Merging salary data with unemployment rates
- Filtering to U.S.-based companies to avoid sample imbalance
- Final modeling dataset: >10,000 observations
-
Analysis & Visualization
- Linear Regression (with interactions)
- Generalized Linear Models (Poisson, Quasi-Poisson, Binomial)
- Generalized Additive Models (GAM)
- Neural Networks with cross-validation
- Diagnostic plots and residual analysis
- R (≥ 4.2 recommended)
- RStudio (optional, but recommended)
install.packages(c(
"car", "caret", "countrycode", "dplyr", "e1071", "ggplot2",
"kernlab", "lmtest", "multcomp", "neuralnet", "nnet", "plotly",
"purrr", "randomForest", "readxl", "tidyr", "tools", "VGAM"
))- Experience level is the strongest salary determinant, with senior and executive roles earning substantially more.
- Job specialization matters: AI Engineers earn the highest salaries, followed by Data Scientists, Data Engineers, and Data Analysts.
- Company size has a modest effect, with small companies paying less on average.
- Remote work ratio and unemployment rate contribute little explanatory power.
- All tested models explain ~28–30% of salary variance, highlighting strong unobserved factors.
- Salary data is self-reported and aggregated from multiple sources.
- Strong sample imbalance toward U.S.-based companies.
- Some models rely on artificial target transformations (Poisson, Binomial).
- Neural networks showed limited performance improvements despite tuning.
The code in this repository is provided for academic and analytical purposes.
The Kaggle salary dataset is subject to Kaggle’s licensing terms.
World Bank / ILO unemployment data is subject to the original data provider’s license and attribution requirements.
Celina Breuer (part of a group project)