A comprehensive AI-powered system for matching resumes to job descriptions using advanced natural language processing, multi-dimensional scoring, and machine learning models. This project provides complete pipelines for resume parsing, job analysis, and intelligent ranking with research-grade analysis capabilities.
- GPT-4-mini Integration: Intelligent parsing of resumes into structured JSON format
- PII Removal: Comprehensive removal of personally identifiable information
- Multi-format Support: Handles both text and HTML resume formats
- Structured Data Extraction: Extracts experience, education, skills, and personal information
- 5-Dimensional Scoring: General, Skills, Experience, Location, Education matching
- Advanced Education Matcher: 177 curated field mappings + 64k Hugging Face academic subjects
- Skills Matching: Specialized matching for technical and professional skills
- Experience Matching: Analyzes work experience and career progression
- Location Matching: Geographic compatibility assessment with semantic similarity
- Weighted Scoring: Configurable weights for different matching dimensions
- Model Comparison: Compare CareerBERT vs general models (all-mpnet-base-v2)
- SHAP-Enhanced Explainable AI: Feature contribution analysis and what-if scenarios
- Learning-to-Rank: Machine learning models with adaptive cross-validation
- Category Analysis: Comprehensive performance analysis by resume categories
- Diversity Analytics: Bias detection and gender representation analysis
- π Gaucher et al. (2011) Gender Bias Detection: Research-grade analysis of gender-coded language in job descriptions
- Linux/macOS:
rank.shshell script - Windows:
rank.batbatch file +rank.ps1PowerShell script - Enhanced Windows Integration: Parameter validation, dependency checking, colorized output
- Statistical Analysis: Confidence intervals, significance testing, correlation analysis
- Bias Mitigation: Gender bias detection using Gaucher et al. methodology
- Export Capabilities: Multiple output formats for research documentation
- Batch Processing: Efficient processing of large datasets
- Python 3.8+ (Python 3.11+ recommended)
- OpenAI API key (for resume parsing)
- Git
- 4GB+ RAM recommended for model processing
- 2GB+ disk space for models and data
- Windows 10+ or Windows Server 2019+
- PowerShell 5.1+ (for
rank.ps1) - Command Prompt (for
rank.bat)
- Bash shell
- Standard UNIX utilities
git clone <your-repository-url>
cd model-trainingpython -m venv .venv
source .venv/bin/activatepython -m venv .venv
.venv\Scripts\activatepython -m venv .venv
.venv\Scripts\Activate.ps1pip install -r requirements.txtpython -m spacy download en_core_web_lg- Named Entity Recognition (NER) for PII removal
- Advanced text processing and entity extraction
- Resume parsing and analysis
Create a .env file in the root directory:
cp .env.example .env # Linux/macOS
copy .env.example .env # WindowsEdit the .env file and add your OpenAI API key:
OPENAI_SECRET=sk-your-actual-openai-api-key-hereDownload the sentence transformer models:
python -m core.preprocessors.download_sentence_transformerThis downloads:
sentence-transformers/all-mpnet-base-v2- General purpose modelsentence-transformers/careerbert-jg- Job/career specialized model
# Make executable and run
chmod +x rank.sh
./rank.sh# Double-click rank.bat or run from Command Prompt
rank.bat# Basic run
.\rank.ps1
# Environment check only
.\rank.ps1 -CheckDependencies
# Custom configuration with auto-open results
.\rank.ps1 -NumJobs 5 -NumResumes 20 -OpenResults
# Full research setup
.\rank.ps1 -NumJobs 10 -NumResumes 100 -ModelComparison -ExplainableAI -DiversityAnalysis -LearningToRank -Verbose -OpenResults# Basic resume-job ranking
python runners/rank.py --num-jobs 5 --num-resumes 20
# Category comparison analysis
python runners/rank.py --resume-categories INFORMATION-TECHNOLOGY HR --category-analysis
# Model comparison research
python runners/rank.py --model-comparison --explainable-ai --diversity-analysismodel-training/
βββ core/
β βββ matching_engine/ # Multi-dimensional matching system
β β βββ base.py # Base matching engine
β β βββ general.py # General semantic matching
β β βββ skills.py # Skills-specific matching
β β βββ experience.py # Experience matching
β β βββ location.py # Location matching with semantic similarity
β β βββ education.py # π Advanced education matching with 177 field mappings
β β βββ engine.py # Main engine coordination
β βββ explainable_ai.py # π SHAP-enhanced explainable AI
β βββ learning_to_rank.py # π Learning-to-rank ML models
β βββ diversity_analytics.py # π Comprehensive bias analysis + Gaucher et al.
β βββ models.py # Pydantic data models
β βββ resume_parser.py # AI-powered resume parsing
β βββ openai/ # OpenAI integration
β βββ utils.py # Utility functions
βββ runners/
β βββ rank.py # Main ranking script with advanced features
βββ datasets/ # Data files
β βββ resumes_final.csv # Processed resume data
β βββ job_descriptions.csv # Job descriptions data
βββ logs/ # Generated output files
β βββ ranking_results_*.csv # Ranking results
β βββ diversity_analysis_*.json # π Comprehensive diversity analysis
β βββ bias_report_*.txt # π Bias reports with gender-coded language analysis
β βββ explanations_*.json # π SHAP-enhanced explanations
β βββ ml_ranking_results_*.csv # π Learning-to-rank results
β βββ feature_importance_*.txt # π Feature importance reports
βββ rank.sh # Linux/macOS configuration script
βββ rank.bat # π Windows batch file
βββ rank.ps1 # π Windows PowerShell script (recommended)
βββ METHODOLOGY.md # Detailed methodology
βββ README.md # This file
--resumes-file: Path to resumes CSV file--jobs-file: Path to job descriptions CSV file--num-resumes: Number of resumes to process--num-jobs: Number of jobs to process
--resume-categories: Filter by categories (INFORMATION-TECHNOLOGY, HR, AUTOMOBILE)--exclude-resume-categories: Exclude specific categories--job-keywords: Filter jobs by keywords--balanced-categories: Enable balanced category sampling--category-analysis: Enable detailed category analysis
--general-weight: Weight for general semantic matching (default: 8.0)--skills-weight: Weight for skills matching (default: 1.0)--experience-weight: Weight for experience matching (default: 1.0)--location-weight: Weight for location matching (default: 1.0)--education-weight: Weight for education matching (default: 1.0)
--general-model: Model for general matching--skills-model: Model for skills matching--model-comparison: Enable model comparison mode--models-to-compare: List of models to compare
--explainable-ai: Generate SHAP-enhanced explanations--diversity-analysis: Perform comprehensive bias analysis (includes Gaucher et al.)--learning-to-rank: Use ML models for ranking improvement--ltr-model-type: Learning-to-rank model (linear, random_forest, gradient_boosting)
--output-file: Custom output file path--top-k: Number of top matches per job--verbose: Enable detailed logging
job_id,job_position,job_company,rank,resume_id,resume_category,total_score,general_score,skills_score,experience_score,location_score,education_score{
"summary": {
"total_candidates": 100,
"gender_diversity_index": 0.87,
"diversity_assessment": "High Diversity"
},
"gender_coded_language": {
"methodology": "Gaucher et al. (2011)",
"overall_statistics": {
"masculine_bias_percentage": 15.0,
"feminine_bias_percentage": 8.0,
"neutral_percentage": 77.0
},
"job_analyses": [
{
"job_title": "Software Engineer",
"gender_polarity": 2,
"bias_classification": "Moderate Masculine Bias",
"masculine_words_found": ["competitive", "dominant"],
"recommendations": ["Replace 'competitive' with 'collaborative'"]
}
],
"bias_assessment": "Low Bias Risk"
}
}{
"rank": 1,
"resume_id": "12345",
"job_position": "Data Scientist",
"explanation": {
"feature_contributions": {
"general_score": 0.35,
"skills_score": 0.40,
"experience_score": 0.15
},
"what_if_analysis": {
"if_skills_improved_10%": {
"new_total_score": 87.5,
"rank_change": "+2 positions"
}
}
}
}Analyze job descriptions for gender-coded language using Gaucher et al. (2011) methodology:
# Basic diversity analysis
python runners/rank.py --diversity-analysis
# Focus on bias detection
python runners/rank.py --diversity-analysis --num-jobs 20Sample Output:
- Masculine Bias Detected: 15% of jobs contain masculine-coded words
- Bias Classification: "Competitive", "dominant", "aggressive" language detected
- Recommendations: Replace biased language with neutral alternatives
Compare fine-tuned models (CareerBERT) against general models:
python runners/rank.py --model-comparison --category-analysisSample Results:
- CareerBERT: 59.00 Β± 2.37 average score
- All-MPNet-Base-v2: 31.34 Β± 0.89 average score
- 88% improvement with domain-specific model
Generate detailed explanations with SHAP:
python runners/rank.py --explainable-ai --num-jobs 5Features:
- Feature Contribution Analysis: Which dimensions drive rankings
- What-if Scenarios: Impact of changing candidate profiles
- Global Feature Importance: Overall system behavior insights
Use machine learning to improve ranking quality:
python runners/rank.py --learning-to-rank --ltr-model-type gradient_boostingCapabilities:
- Adaptive Cross-Validation: Works with any dataset size
- Multiple ML Models: Linear, Random Forest, Gradient Boosting
- Feature Importance: Identifies most important ranking factors
Based on the landmark research: "Evidence That Gendered Wording in Job Advertisements Exists and Sustains Gender Inequality"
- 42 Masculine-Coded Words: competitive, aggressive, dominant, decisive, etc.
- 39 Feminine-Coded Words: collaborative, supportive, nurturing, empathetic, etc.
- Gender Polarity Score:
masculine_score - feminine_score
- +3 or higher: Strong Masculine Bias
β οΈ - +1 to +2: Moderate Masculine Bias
- -1 to +1: Gender Neutral β
- -2 to -1: Moderate Feminine Bias
- -3 or lower: Strong Feminine Bias
β οΈ
Used by LinkedIn, Indeed, Glassdoor, and Fortune 500 companies for bias-free job postings.
- 64k+ Academic Subjects: Automatically loaded from Hugging Face datasets
- 177 Local Mappings: Technology, Business, Engineering, Science, Healthcare, etc.
- Hierarchical Matching: Field categories and subcategories
- Degree Level Analysis: Certificate β PhD progression
- Primary: WikiAcademicSubjects (Hugging Face)
- Fallback: Comprehensive local mappings
- Production Safeguards: Automatic fallback if datasets unavailable
# Quick environment check
.\rank.ps1 -CheckDependencies
# Basic analysis with results viewing
.\rank.ps1 -NumJobs 5 -NumResumes 20 -OpenResults
# Research configuration
.\rank.ps1 -NumJobs 10 -NumResumes 100 -ResumeCategories "INFORMATION-TECHNOLOGY","HR" -CategoryAnalysis -ModelComparison -ExplainableAI -DiversityAnalysis -LearningToRank -Verbose
# Custom weights (emphasize skills)
.\rank.ps1 -SkillsWeight 3.0 -GeneralWeight 1.0 -ExperienceWeight 1.0 -LocationWeight 0.5
# Gender bias analysis focus
.\rank.ps1 -DiversityAnalysis -NumJobs 20 -Verbose# Basic run with current configuration
./rank.sh
# Custom configuration (edit rank.sh or run directly)
python runners/rank.py --num-jobs 10 --num-resumes 100 --diversity-analysis --explainable-ai --learning-to-rank
# Model comparison
python runners/rank.py --model-comparison --models-to-compare sentence-transformers/careerbert-jg sentence-transformers/all-mpnet-base-v2# Core ML & NLP
sentence-transformers>=2.2.2
torch>=2.0.0
spacy>=3.7.0
datasets>=2.14.0 # π For Hugging Face integration
# Advanced Features
shap==0.43.0 # π For explainable AI
scikit-learn>=1.3.0 # π For learning-to-rank
# Data Processing
pandas>=2.0.0
numpy>=1.24.0
# AI Integration
openai>=1.0.0
pydantic>=2.0.0
# Utilities
python-dotenv>=1.0.0
tqdm>=4.65.0Critical Installation Steps:
- Download spaCy model:
python -m spacy download en_core_web_lg- Verify installation (Windows PowerShell):
.\rank.ps1 -CheckDependencies- Test basic functionality:
python runners/rank.py --num-jobs 1 --num-resumes 2- Processing Time: ~30 seconds per resume for AI parsing
- API Costs: GPT-4-mini usage (~$0.01 per resume)
- Memory Usage: 2-4GB RAM for model processing
- Storage: Models require ~2GB disk space
- PII Removal: Comprehensive removal of personal identifiers
- π Gender Bias Detection: Research-grade analysis prevents discriminatory language
- Fair Sampling: Balanced category sampling prevents algorithmic bias
- Transparency: Full explainability with SHAP analysis
- Statistical Rigor: Confidence intervals and significance testing
- Reproducibility: Deterministic processing with comprehensive logging
- Peer-Reviewed Methods: Implements Gaucher et al. (2011) methodology
- Academic Standards: Suitable for research publication
PowerShell Execution Policy
# If PowerShell blocks script execution
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUserPython Not Found (Windows)
# Check if Python is in PATH
python --version
# If not found, reinstall Python with "Add to PATH" option
# Or add manually: System Properties > Environment VariablesOpenAI API Errors
# Check API key configuration
cat .env | grep OPENAI_SECRET # Linux/macOS
type .env | findstr OPENAI_SECRET # Windows
# Verify account credits and rate limits at platform.openai.comModel Loading Failures
# Re-download models
python -m core.preprocessors.download_sentence_transformer
# Check disk space (need 2GB+)
df -h # Linux/macOS
dir C:\ /s # WindowsMemory Issues
# Reduce dataset size
python runners/rank.py --num-resumes 10 --num-jobs 5
# Monitor memory usage
htop # Linux/macOS
taskmgr # Windowsπ Dependency Issues
# Windows: Comprehensive environment check
.\rank.ps1 -CheckDependencies
# Linux/macOS: Manual check
python -c "import torch, transformers, sentence_transformers, spacy, datasets, shap; print('All dependencies OK')"If you use this system in research, please cite:
@software{resume_job_ranking_system,
title={Multi-Dimensional Resume-Job Ranking System with Gender Bias Detection},
author={[Your Name]},
year={2024},
url={[Repository URL]},
note={Implements Gaucher et al. (2011) gender bias detection methodology}
}
@article{gaucher2011evidence,
title={Evidence that gendered wording in job advertisements exists and sustains gender inequality},
author={Gaucher, Danielle and Friesen, Justin and Kay, Aaron C},
journal={Journal of personality and social psychology},
volume={101},
number={1},
pages={109},
year={2011},
publisher={American Psychological Association}
}- Fork the repository
- Create feature branch (
git checkout -b feature/new-feature) - Commit changes (
git commit -am 'Add new feature') - Push branch (
git push origin feature/new-feature) - Create Pull Request
Areas for contribution:
- Additional bias detection methods
- New matching dimensions
- Enhanced explainability features
- Performance optimizations
- Cross-platform improvements
- Documentation: See METHODOLOGY.md for detailed technical information
- Issues: Use GitHub Issues for bug reports and feature requests
- Configuration:
- See
rank.sh(Linux/macOS) - See
rank.batandrank.ps1(Windows) - Use
.\rank.ps1 -CheckDependenciesfor Windows diagnostics
- See
MIT License
- β Cross-Platform Support: Windows batch + PowerShell scripts
- β Gaucher et al. Gender Bias Detection: Research-grade bias analysis
- β SHAP Explainable AI: Feature contribution analysis
- β Learning-to-Rank: Machine learning ranking enhancement
- β Advanced Education Matching: 64k+ academic subjects + 177 local mappings
- β Comprehensive Diversity Analytics: Gender representation + bias detection
- β Enhanced Error Handling: Null-safe processing throughout
- β Adaptive Cross-Validation: Works with any dataset size
- β Production Safeguards: Robust fallback mechanisms
- Better memory management for large datasets
- Improved error messages and diagnostics
- Enhanced logging and debugging capabilities
- Optimized model loading and caching
- Cross-platform path handling
- Publication-ready statistical analysis
- Peer-reviewed bias detection methodology
- Comprehensive explainability features
- Advanced ML model comparison
- Industry-standard diversity metrics