A comprehensive data science project analyzing the AI job market using machine learning techniques for salary prediction, job classification, and market segmentation.
This project analyzes AI job market data to provide insights into:
- Salary Prediction: Predict salaries based on job characteristics
- Job Classification: Categorize jobs by experience level, company size, etc.
- Market Segmentation: Identify distinct job market clusters
- Trend Analysis: Understand market dynamics and patterns
Source: Kaggle AI Job Market Dataset
Size: 15,000 job postings
Features: 19 columns including salary, experience, location, company details, and job requirements
salary_usd: Annual salary in USD (target for regression)experience_level: EN (Entry), MI (Mid), SE (Senior), EX (Executive)employment_type: FT, PT, CT, FLcompany_size: S (Small), M (Medium), L (Large)remote_ratio: 0, 50, 100 (percentage remote work)job_title,company_location,employee_residence
βββ src/ # Source code (modular architecture)
β βββ data/ # Data processing modules
β β βββ loader.py # Database operations and data loading
β β βββ cleaner.py # Data cleaning and preprocessing
β βββ analysis/ # Analysis modules
β β βββ eda.py # Exploratory data analysis
β β βββ outliers.py # Outlier detection and analysis
β β βββ cleaning.py # Advanced data cleaning
β βββ models/ # Machine learning modules
β β βββ regression.py # Salary prediction models
β β βββ classification.py # Job classification models
β β βββ clustering.py # Market segmentation models
β βββ utils/ # Utility modules
β βββ config.py # Configuration management
β βββ helpers.py # Helper functions
βββ scripts/ # Pipeline scripts
β βββ run_pipeline.py # Complete data pipeline
β βββ run_analysis.py # Analysis pipeline
β βββ run_modeling.py # Modeling pipeline
βββ config/ # Configuration files
β βββ config.yaml # Main configuration
βββ outputs/ # Generated outputs
β βββ plots/ # Visualizations
β βββ models/ # Trained models
β βββ reports/ # Analysis reports
βββ data/ # Data files
βββ tests/ # Unit tests
βββ run_modeling.py # Main launcher script
βββ main.py # CLI entry point
- Clean separation of data processing, analysis, and modeling
- Reusable components with consistent interfaces
- Configuration-driven with YAML config and environment variables
- Comprehensive error handling and logging
- Exploratory Data Analysis: Distribution analysis, correlation matrices, trend visualization
- Outlier Detection: Statistical and ML-based outlier identification
- Data Quality Assessment: Missing values, duplicates, data type validation
- Regression: Salary prediction using Multiple algorithms (Linear, Ridge, Lasso, Random Forest, Gradient Boosting)
- Classification: Job categorization with high accuracy (99%+ accuracy achieved)
- Clustering: Market segmentation using K-Means, DBSCAN, Agglomerative, Gaussian Mixture
- Automated visualizations with professional plots
- Detailed reports for each analysis phase
- Model performance metrics and comparisons
- Business insights and recommendations
# Install dependencies
pip install -r requirements.txt
# Set up database credentials (optional)
cp .env.template .env
# Edit .env with your database credentials# Run everything (recommended for first-time users)
python run_modeling.py --all
# Or run specific components
python run_modeling.py --regression
python run_modeling.py --classification
python run_modeling.py --clustering- Plots:
outputs/plots/- Visualizations and charts - Models:
outputs/models/- Trained ML models (.joblib files) - Reports:
outputs/reports/- Detailed analysis reports
python scripts/run_pipeline.py # Complete data processingpython scripts/run_analysis.py # EDA, outliers, cleaningpython scripts/run_modeling.py # ML models and predictions- Salary Prediction: RΒ² = 0.66 (Gradient Boosting)
- Job Classification: 100% Accuracy (Random Forest)
- Market Segmentation: 2 optimal clusters identified
- Experience level is the strongest predictor of salary
- Remote work ratio significantly impacts compensation
- Company size correlates with salary ranges
- Geographic location affects market dynamics
- Job market segments into distinct clusters with different characteristics
- Python 3.8+
- Data Science: pandas, numpy, scipy
- Machine Learning: scikit-learn
- Visualization: matplotlib, seaborn
- Database: SQLite
- Configuration: PyYAML, python-dotenv
- Testing: pytest
run_modeling.py- Main launcher scriptsrc/models/regression.py- Salary prediction modelssrc/models/classification.py- Job classification modelssrc/models/clustering.py- Market segmentation modelsconfig/config.yaml- Configuration settingsrequirements.txt- Python dependencies
python run_modeling.py --classification --target-column company_sizepython run_modeling.py --clustering --n-clusters 5python -c "from src.models.regression import run_salary_prediction; run_salary_prediction()"After running the pipeline, you'll have:
- Model performance comparisons
- Prediction vs actual plots
- Confusion matrices
- Cluster visualizations
- Feature importance plots
- Data distribution charts
- Salary predictor (regression)
- Experience level classifier
- Job market clusterer
- Model performance metrics
- Feature importance analysis
- Cluster characteristics
- Business recommendations
The project uses a centralized configuration system:
config/config.yaml- Main settings.env- Environment-specific variables (database credentials, API keys)- Command-line arguments for runtime customization
# Run unit tests
python -m pytest tests/
# Run specific test
python -m pytest tests/test_models.py- Web interface for interactive analysis
- Real-time data pipeline integration
- Advanced feature engineering
- Deep learning models
- API for model serving
- Dashboard development
This project is licensed under the MIT License - see the LICENSE file for details.
Built with β€οΈ for data science and AI job market analysis