This project features a multilabel text classification model to extract hard and soft skills from tech job descriptions. The model was trained using data scraped from Indeed. The data was collected in two steps:
- Job URL collection: Due to limitations I could only scrape 26k job links scraped from the first page of various job locations. The URL scraper can be found in
scraper\link_scraper.py. - Job Description Scraping: Using the job URLs, the job descriptions were scraped with
scraper\job_description_scraper.py. All the scraped data can be found in thedatafolder.
- Detects 83 different skills (hard + soft skills)
- Hard skills: Python, SQL, JavaScript, React, Node.js, AWS, Docker, Git, Java, C++, etc.
- Soft skills: Teamwork, Collaboration, Communication, Critical Thinking, Problem Solving, Leadership, Adaptability, etc.
- State-of-the-art transformer-based multilabel classification
- Best performing model: ModernBERT — near-perfect results
- Fast inference using ONNX runtime
- Live interactive demo on Hugging Face Spaces
- Full-featured Flask web application including:
- Job Description → Skills Extraction
- Resume ↔ Job Description → Skill Matching (shows matching & missing skills + confidence scores)
It seemed like a big task to manually label all 22k job descriptions. Hence, I used a rule-based multi-labeling system using regex.
-
Comprehensive Skills Dictionary
A total of 83 target skills (hard + soft) were defined covering the most frequently mentioned competencies in tech job postings. -
Regex-Based Pattern Matching
Each skill is associated with a carefully crafted regular expression that captures:- Common abbreviations used to define the skills
- Different spellings & variations
- Full names and short forms
- Case-insensitive matching
-
One-Hot Encoding
For every job description, it was checked if the text contains any of the defined patterns for each skill.
I initally started out with distilroberta-base. Then I explored 3 more transformer-based models and found modernbert to be the best among them with an accuracy score of 0.99. Finally, I converted the trained model into ONNX.
| Model | Accuracy | F1-Samples | F1-Macro | F1-Micro |
|---|---|---|---|---|
| distilroberta-base | 0.9700 | 0.9446 | 0.9243 | 0.9498 |
| modernbert (best) | 0.9996 | 0.9928 | 0.9969 | 0.9990 |
| all-MiniLM-L6-v2 | 0.9700 | 0.9028 | 0.8782 | 0.9081 |
| bert-base-uncased | 0.9800 | 0.9304 | 0.9110 | 0.9348 |
The model was depployed in HugginFace Spaces Gradio App. You'll get the implementation in the deployment folder or in the gradio app
The Flask-based web app lets user extract skills from job descriptions and also match their resume with the required skills. Try the Skill Extractor and Resume-Matcher on Render.
# Clone the repo
git clone https://github.com/Naawshin/Multilabel-Skill-Classifier.git
# Switch to flask
git switch flask
# Create virtual environment
python -m venv venv
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Start the app
python app.pyFeedback, feature requests, bug reports, and pull requests are very welcome!
Feel free to reach out:
- ✉️ Email: nawshintabassum88@gmail.com
- 🔗 LinkedIn: https://www.linkedin.com/in/nowshin-tabasum/
If this project helped you or you just found it interesting, please consider giving it a star ⭐ It really helps the project grow!


