Skip to content

Justicea83/tmu-mrp

Repository files navigation

Multi-Dimensional Resume-Job Ranking System

A comprehensive AI-powered system for matching resumes to job descriptions using advanced natural language processing, multi-dimensional scoring, and machine learning models. This project provides complete pipelines for resume parsing, job analysis, and intelligent ranking with research-grade analysis capabilities.

πŸš€ Key Features

πŸ€– AI-Powered Resume Processing

  • GPT-4-mini Integration: Intelligent parsing of resumes into structured JSON format
  • PII Removal: Comprehensive removal of personally identifiable information
  • Multi-format Support: Handles both text and HTML resume formats
  • Structured Data Extraction: Extracts experience, education, skills, and personal information

πŸ“Š Multi-Dimensional Matching Engine

  • 5-Dimensional Scoring: General, Skills, Experience, Location, Education matching
  • Advanced Education Matcher: 177 curated field mappings + 64k Hugging Face academic subjects
  • Skills Matching: Specialized matching for technical and professional skills
  • Experience Matching: Analyzes work experience and career progression
  • Location Matching: Geographic compatibility assessment with semantic similarity
  • Weighted Scoring: Configurable weights for different matching dimensions

πŸ”¬ Research & Analysis Tools

  • Model Comparison: Compare CareerBERT vs general models (all-mpnet-base-v2)
  • SHAP-Enhanced Explainable AI: Feature contribution analysis and what-if scenarios
  • Learning-to-Rank: Machine learning models with adaptive cross-validation
  • Category Analysis: Comprehensive performance analysis by resume categories
  • Diversity Analytics: Bias detection and gender representation analysis
  • πŸ†• Gaucher et al. (2011) Gender Bias Detection: Research-grade analysis of gender-coded language in job descriptions

🌍 Cross-Platform Support

  • Linux/macOS: rank.sh shell script
  • Windows: rank.bat batch file + rank.ps1 PowerShell script
  • Enhanced Windows Integration: Parameter validation, dependency checking, colorized output

🎯 Advanced Features

  • Statistical Analysis: Confidence intervals, significance testing, correlation analysis
  • Bias Mitigation: Gender bias detection using Gaucher et al. methodology
  • Export Capabilities: Multiple output formats for research documentation
  • Batch Processing: Efficient processing of large datasets

πŸ“‹ Prerequisites

  • Python 3.8+ (Python 3.11+ recommended)
  • OpenAI API key (for resume parsing)
  • Git
  • 4GB+ RAM recommended for model processing
  • 2GB+ disk space for models and data

Platform-Specific Requirements

Windows

  • Windows 10+ or Windows Server 2019+
  • PowerShell 5.1+ (for rank.ps1)
  • Command Prompt (for rank.bat)

Linux/macOS

  • Bash shell
  • Standard UNIX utilities

πŸ› οΈ Installation

1. Clone the Repository

git clone <your-repository-url>
cd model-training

2. Create Virtual Environment

Linux/macOS:

python -m venv .venv
source .venv/bin/activate

Windows (Command Prompt):

python -m venv .venv
.venv\Scripts\activate

Windows (PowerShell):

python -m venv .venv
.venv\Scripts\Activate.ps1

3. Install Dependencies

pip install -r requirements.txt

4. Download spaCy Language Model

python -m spacy download en_core_web_lg

⚠️ Important: This downloads the large English language model (774MB) required for:

  • Named Entity Recognition (NER) for PII removal
  • Advanced text processing and entity extraction
  • Resume parsing and analysis

5. Environment Configuration

Create a .env file in the root directory:

cp .env.example .env  # Linux/macOS
copy .env.example .env  # Windows

Edit the .env file and add your OpenAI API key:

OPENAI_SECRET=sk-your-actual-openai-api-key-here

6. Download Required Models

Download the sentence transformer models:

python -m core.preprocessors.download_sentence_transformer

This downloads:

  • sentence-transformers/all-mpnet-base-v2 - General purpose model
  • sentence-transformers/careerbert-jg - Job/career specialized model

🎯 Quick Start

Linux/macOS

# Make executable and run
chmod +x rank.sh
./rank.sh

Windows - Batch File (Simple)

# Double-click rank.bat or run from Command Prompt
rank.bat

Windows - PowerShell (Recommended)

# Basic run
.\rank.ps1

# Environment check only
.\rank.ps1 -CheckDependencies

# Custom configuration with auto-open results
.\rank.ps1 -NumJobs 5 -NumResumes 20 -OpenResults

# Full research setup
.\rank.ps1 -NumJobs 10 -NumResumes 100 -ModelComparison -ExplainableAI -DiversityAnalysis -LearningToRank -Verbose -OpenResults

Direct Python Execution (All Platforms)

# Basic resume-job ranking
python runners/rank.py --num-jobs 5 --num-resumes 20

# Category comparison analysis
python runners/rank.py --resume-categories INFORMATION-TECHNOLOGY HR --category-analysis

# Model comparison research
python runners/rank.py --model-comparison --explainable-ai --diversity-analysis

πŸ“ Project Structure

model-training/
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ matching_engine/          # Multi-dimensional matching system
β”‚   β”‚   β”œβ”€β”€ base.py              # Base matching engine
β”‚   β”‚   β”œβ”€β”€ general.py           # General semantic matching
β”‚   β”‚   β”œβ”€β”€ skills.py            # Skills-specific matching
β”‚   β”‚   β”œβ”€β”€ experience.py        # Experience matching
β”‚   β”‚   β”œβ”€β”€ location.py          # Location matching with semantic similarity
β”‚   β”‚   β”œβ”€β”€ education.py         # πŸ†• Advanced education matching with 177 field mappings
β”‚   β”‚   └── engine.py            # Main engine coordination
β”‚   β”œβ”€β”€ explainable_ai.py        # πŸ†• SHAP-enhanced explainable AI
β”‚   β”œβ”€β”€ learning_to_rank.py      # πŸ†• Learning-to-rank ML models
β”‚   β”œβ”€β”€ diversity_analytics.py   # πŸ†• Comprehensive bias analysis + Gaucher et al.
β”‚   β”œβ”€β”€ models.py                # Pydantic data models
β”‚   β”œβ”€β”€ resume_parser.py         # AI-powered resume parsing
β”‚   β”œβ”€β”€ openai/                  # OpenAI integration
β”‚   └── utils.py                 # Utility functions
β”œβ”€β”€ runners/
β”‚   └── rank.py                  # Main ranking script with advanced features
β”œβ”€β”€ datasets/                    # Data files
β”‚   β”œβ”€β”€ resumes_final.csv        # Processed resume data
β”‚   └── job_descriptions.csv     # Job descriptions data
β”œβ”€β”€ logs/                        # Generated output files
β”‚   β”œβ”€β”€ ranking_results_*.csv    # Ranking results
β”‚   β”œβ”€β”€ diversity_analysis_*.json # πŸ†• Comprehensive diversity analysis
β”‚   β”œβ”€β”€ bias_report_*.txt        # πŸ†• Bias reports with gender-coded language analysis
β”‚   β”œβ”€β”€ explanations_*.json      # πŸ†• SHAP-enhanced explanations
β”‚   β”œβ”€β”€ ml_ranking_results_*.csv # πŸ†• Learning-to-rank results
β”‚   └── feature_importance_*.txt # πŸ†• Feature importance reports
β”œβ”€β”€ rank.sh                      # Linux/macOS configuration script
β”œβ”€β”€ rank.bat                     # πŸ†• Windows batch file
β”œβ”€β”€ rank.ps1                     # πŸ†• Windows PowerShell script (recommended)
β”œβ”€β”€ METHODOLOGY.md               # Detailed methodology
└── README.md                    # This file

πŸ”§ Configuration Options

πŸ“Š Data Configuration

  • --resumes-file: Path to resumes CSV file
  • --jobs-file: Path to job descriptions CSV file
  • --num-resumes: Number of resumes to process
  • --num-jobs: Number of jobs to process

🎯 Filtering Options

  • --resume-categories: Filter by categories (INFORMATION-TECHNOLOGY, HR, AUTOMOBILE)
  • --exclude-resume-categories: Exclude specific categories
  • --job-keywords: Filter jobs by keywords
  • --balanced-categories: Enable balanced category sampling
  • --category-analysis: Enable detailed category analysis

βš–οΈ Scoring Configuration

  • --general-weight: Weight for general semantic matching (default: 8.0)
  • --skills-weight: Weight for skills matching (default: 1.0)
  • --experience-weight: Weight for experience matching (default: 1.0)
  • --location-weight: Weight for location matching (default: 1.0)
  • --education-weight: Weight for education matching (default: 1.0)

πŸ€– Model Configuration

  • --general-model: Model for general matching
  • --skills-model: Model for skills matching
  • --model-comparison: Enable model comparison mode
  • --models-to-compare: List of models to compare

🧠 Advanced Features

  • --explainable-ai: Generate SHAP-enhanced explanations
  • --diversity-analysis: Perform comprehensive bias analysis (includes Gaucher et al.)
  • --learning-to-rank: Use ML models for ranking improvement
  • --ltr-model-type: Learning-to-rank model (linear, random_forest, gradient_boosting)

πŸ“ˆ Output Options

  • --output-file: Custom output file path
  • --top-k: Number of top matches per job
  • --verbose: Enable detailed logging

πŸ“Š Understanding the Output

Main Results File

job_id,job_position,job_company,rank,resume_id,resume_category,total_score,general_score,skills_score,experience_score,location_score,education_score

πŸ†• Diversity Analysis (JSON)

{
  "summary": {
    "total_candidates": 100,
    "gender_diversity_index": 0.87,
    "diversity_assessment": "High Diversity"
  },
  "gender_coded_language": {
    "methodology": "Gaucher et al. (2011)",
    "overall_statistics": {
      "masculine_bias_percentage": 15.0,
      "feminine_bias_percentage": 8.0,
      "neutral_percentage": 77.0
    },
    "job_analyses": [
      {
        "job_title": "Software Engineer",
        "gender_polarity": 2,
        "bias_classification": "Moderate Masculine Bias",
        "masculine_words_found": ["competitive", "dominant"],
        "recommendations": ["Replace 'competitive' with 'collaborative'"]
      }
    ],
    "bias_assessment": "Low Bias Risk"
  }
}

πŸ†• SHAP Explanations (JSON)

{
  "rank": 1,
  "resume_id": "12345",
  "job_position": "Data Scientist",
  "explanation": {
    "feature_contributions": {
      "general_score": 0.35,
      "skills_score": 0.40,
      "experience_score": 0.15
    },
    "what_if_analysis": {
      "if_skills_improved_10%": {
        "new_total_score": 87.5,
        "rank_change": "+2 positions"
      }
    }
  }
}

πŸ§ͺ Research Applications

πŸ†• Gender Bias Analysis

Analyze job descriptions for gender-coded language using Gaucher et al. (2011) methodology:

# Basic diversity analysis
python runners/rank.py --diversity-analysis

# Focus on bias detection
python runners/rank.py --diversity-analysis --num-jobs 20

Sample Output:

  • Masculine Bias Detected: 15% of jobs contain masculine-coded words
  • Bias Classification: "Competitive", "dominant", "aggressive" language detected
  • Recommendations: Replace biased language with neutral alternatives

Model Performance Evaluation

Compare fine-tuned models (CareerBERT) against general models:

python runners/rank.py --model-comparison --category-analysis

Sample Results:

  • CareerBERT: 59.00 Β± 2.37 average score
  • All-MPNet-Base-v2: 31.34 Β± 0.89 average score
  • 88% improvement with domain-specific model

πŸ†• Explainable AI Analysis

Generate detailed explanations with SHAP:

python runners/rank.py --explainable-ai --num-jobs 5

Features:

  • Feature Contribution Analysis: Which dimensions drive rankings
  • What-if Scenarios: Impact of changing candidate profiles
  • Global Feature Importance: Overall system behavior insights

πŸ†• Learning-to-Rank Enhancement

Use machine learning to improve ranking quality:

python runners/rank.py --learning-to-rank --ltr-model-type gradient_boosting

Capabilities:

  • Adaptive Cross-Validation: Works with any dataset size
  • Multiple ML Models: Linear, Random Forest, Gradient Boosting
  • Feature Importance: Identifies most important ranking factors

πŸ” Enhanced Features Deep Dive

πŸ†• Gaucher et al. (2011) Gender Bias Detection

Based on the landmark research: "Evidence That Gendered Wording in Job Advertisements Exists and Sustains Gender Inequality"

Methodology:

  • 42 Masculine-Coded Words: competitive, aggressive, dominant, decisive, etc.
  • 39 Feminine-Coded Words: collaborative, supportive, nurturing, empathetic, etc.
  • Gender Polarity Score: masculine_score - feminine_score

Classification:

  • +3 or higher: Strong Masculine Bias ⚠️
  • +1 to +2: Moderate Masculine Bias
  • -1 to +1: Gender Neutral βœ…
  • -2 to -1: Moderate Feminine Bias
  • -3 or lower: Strong Feminine Bias ⚠️

Real-World Impact:

Used by LinkedIn, Indeed, Glassdoor, and Fortune 500 companies for bias-free job postings.

πŸ†• Advanced Education Matching

Comprehensive Field Coverage:

  • 64k+ Academic Subjects: Automatically loaded from Hugging Face datasets
  • 177 Local Mappings: Technology, Business, Engineering, Science, Healthcare, etc.
  • Hierarchical Matching: Field categories and subcategories
  • Degree Level Analysis: Certificate β†’ PhD progression

Data Sources:

  • Primary: WikiAcademicSubjects (Hugging Face)
  • Fallback: Comprehensive local mappings
  • Production Safeguards: Automatic fallback if datasets unavailable

πŸ’» Platform-Specific Usage

Windows PowerShell Examples

# Quick environment check
.\rank.ps1 -CheckDependencies

# Basic analysis with results viewing
.\rank.ps1 -NumJobs 5 -NumResumes 20 -OpenResults

# Research configuration
.\rank.ps1 -NumJobs 10 -NumResumes 100 -ResumeCategories "INFORMATION-TECHNOLOGY","HR" -CategoryAnalysis -ModelComparison -ExplainableAI -DiversityAnalysis -LearningToRank -Verbose

# Custom weights (emphasize skills)
.\rank.ps1 -SkillsWeight 3.0 -GeneralWeight 1.0 -ExperienceWeight 1.0 -LocationWeight 0.5

# Gender bias analysis focus
.\rank.ps1 -DiversityAnalysis -NumJobs 20 -Verbose

Linux/macOS Examples

# Basic run with current configuration
./rank.sh

# Custom configuration (edit rank.sh or run directly)
python runners/rank.py --num-jobs 10 --num-resumes 100 --diversity-analysis --explainable-ai --learning-to-rank

# Model comparison
python runners/rank.py --model-comparison --models-to-compare sentence-transformers/careerbert-jg sentence-transformers/all-mpnet-base-v2

πŸ“ Key Dependencies

# Core ML & NLP
sentence-transformers>=2.2.2
torch>=2.0.0
spacy>=3.7.0
datasets>=2.14.0  # πŸ†• For Hugging Face integration

# Advanced Features
shap==0.43.0      # πŸ†• For explainable AI
scikit-learn>=1.3.0  # πŸ†• For learning-to-rank

# Data Processing  
pandas>=2.0.0
numpy>=1.24.0

# AI Integration
openai>=1.0.0
pydantic>=2.0.0

# Utilities
python-dotenv>=1.0.0
tqdm>=4.65.0

Critical Installation Steps:

  1. Download spaCy model:
python -m spacy download en_core_web_lg
  1. Verify installation (Windows PowerShell):
.\rank.ps1 -CheckDependencies
  1. Test basic functionality:
python runners/rank.py --num-jobs 1 --num-resumes 2

🚨 Important Considerations

Performance & Costs

  • Processing Time: ~30 seconds per resume for AI parsing
  • API Costs: GPT-4-mini usage (~$0.01 per resume)
  • Memory Usage: 2-4GB RAM for model processing
  • Storage: Models require ~2GB disk space

Privacy & Ethics

  • PII Removal: Comprehensive removal of personal identifiers
  • πŸ†• Gender Bias Detection: Research-grade analysis prevents discriminatory language
  • Fair Sampling: Balanced category sampling prevents algorithmic bias
  • Transparency: Full explainability with SHAP analysis

Research Validity

  • Statistical Rigor: Confidence intervals and significance testing
  • Reproducibility: Deterministic processing with comprehensive logging
  • Peer-Reviewed Methods: Implements Gaucher et al. (2011) methodology
  • Academic Standards: Suitable for research publication

πŸ” Troubleshooting

Windows-Specific Issues

PowerShell Execution Policy

# If PowerShell blocks script execution
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Python Not Found (Windows)

# Check if Python is in PATH
python --version

# If not found, reinstall Python with "Add to PATH" option
# Or add manually: System Properties > Environment Variables

Cross-Platform Issues

OpenAI API Errors

# Check API key configuration
cat .env | grep OPENAI_SECRET  # Linux/macOS
type .env | findstr OPENAI_SECRET  # Windows

# Verify account credits and rate limits at platform.openai.com

Model Loading Failures

# Re-download models
python -m core.preprocessors.download_sentence_transformer

# Check disk space (need 2GB+)
df -h          # Linux/macOS
dir C:\ /s     # Windows

Memory Issues

# Reduce dataset size
python runners/rank.py --num-resumes 10 --num-jobs 5

# Monitor memory usage
htop           # Linux/macOS
taskmgr        # Windows

πŸ†• Dependency Issues

# Windows: Comprehensive environment check
.\rank.ps1 -CheckDependencies

# Linux/macOS: Manual check
python -c "import torch, transformers, sentence_transformers, spacy, datasets, shap; print('All dependencies OK')"

πŸ“„ Citation

If you use this system in research, please cite:

@software{resume_job_ranking_system,
  title={Multi-Dimensional Resume-Job Ranking System with Gender Bias Detection},
  author={[Your Name]},
  year={2024},
  url={[Repository URL]},
  note={Implements Gaucher et al. (2011) gender bias detection methodology}
}

@article{gaucher2011evidence,
  title={Evidence that gendered wording in job advertisements exists and sustains gender inequality},
  author={Gaucher, Danielle and Friesen, Justin and Kay, Aaron C},
  journal={Journal of personality and social psychology},
  volume={101},
  number={1},
  pages={109},
  year={2011},
  publisher={American Psychological Association}
}

🀝 Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/new-feature)
  3. Commit changes (git commit -am 'Add new feature')
  4. Push branch (git push origin feature/new-feature)
  5. Create Pull Request

Areas for contribution:

  • Additional bias detection methods
  • New matching dimensions
  • Enhanced explainability features
  • Performance optimizations
  • Cross-platform improvements

πŸ“ž Support

  • Documentation: See METHODOLOGY.md for detailed technical information
  • Issues: Use GitHub Issues for bug reports and feature requests
  • Configuration:
    • See rank.sh (Linux/macOS)
    • See rank.bat and rank.ps1 (Windows)
    • Use .\rank.ps1 -CheckDependencies for Windows diagnostics

πŸ“„ License

MIT License


πŸŽ‰ What's New in v2.0

πŸ†• Enhanced Features

  • βœ… Cross-Platform Support: Windows batch + PowerShell scripts
  • βœ… Gaucher et al. Gender Bias Detection: Research-grade bias analysis
  • βœ… SHAP Explainable AI: Feature contribution analysis
  • βœ… Learning-to-Rank: Machine learning ranking enhancement
  • βœ… Advanced Education Matching: 64k+ academic subjects + 177 local mappings
  • βœ… Comprehensive Diversity Analytics: Gender representation + bias detection
  • βœ… Enhanced Error Handling: Null-safe processing throughout
  • βœ… Adaptive Cross-Validation: Works with any dataset size
  • βœ… Production Safeguards: Robust fallback mechanisms

πŸ”§ Technical Improvements

  • Better memory management for large datasets
  • Improved error messages and diagnostics
  • Enhanced logging and debugging capabilities
  • Optimized model loading and caching
  • Cross-platform path handling

πŸ“Š Research Capabilities

  • Publication-ready statistical analysis
  • Peer-reviewed bias detection methodology
  • Comprehensive explainability features
  • Advanced ML model comparison
  • Industry-standard diversity metrics

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published