This project provides a workflow to code job descriptions with ONET-SOC codes and their corresponding Job Zones (related to Holland Codes). It uses a combination of Python and R scripts to automate the process, from translating job titles to English to fetching ONET data and allowing for manual verification.
- Automated O*NET Coding: Fetches O*NET-SOC codes for a list of job titles using the O*NET API.
- Translation Support: Can translate job titles from other languages to English using the DeepL API before coding.
- Interactive QA: Includes a manual verification step in R for low-confidence matches, ensuring data quality.
- Configuration-based: API keys and settings are managed in a
config.jsonfile, not hardcoded in the scripts. - Modular Design: Separates the O*NET API interaction (Python) from data processing and user interaction (R).
ONET_coder.py: A Python script that connects to the O*NET API and retrieves SOC codes for a given list of queries.job_description_coding.R: The main R script that orchestrates the entire workflow, including data preparation, translation, calling the Python script, and manual data verification.config.json.template: A template for the configuration file. You need to copy this toconfig.jsonand add your API keys..gitignore: Excludes sensitive and intermediate files from version control.README.md: This file.
- R and RStudio: You will need R and preferably RStudio installed on your system.
- Python 3: The
ONET_coder.pyscript is written in Python 3. - API Keys:
- O*NET Web Services: You need to register for a free account at the O*NET Developer Center to get a username and password.
- DeepL API (Optional): If you need to translate job descriptions, you'll need a DeepL API key. A free plan is available.
-
Clone the repository:
git clone <repository-url> cd ONET_Holland_Codes
-
Install R Packages: The
job_description_coding.Rscript will automatically try to install the required R packages when you run it for the first time. The required packages are:reticulate,deeplr,jsonlite, andrstudioapi. -
Create the Configuration File:
- Make a copy of the
config.json.templatefile and rename it toconfig.json. - Open
config.jsonand fill in your O*NET username/password and your DeepL API key.
{ "onet_username": "YOUR_ONET_USERNAME", "onet_password": "YOUR_ONET_PASSWORD", "deepl_auth_key": "YOUR_DEEPL_AUTH_KEY" }If you don't need translation, you can leave
deepl_auth_keyempty. - Make a copy of the
-
Prepare the O*NET Reference Data: The script requires a file named
onet_data_reference.csvin the root directory. The file is used to map O*NET codes to Job Zones. Although the 2024 Version is provided with the repository, please create a current version of this file yourself from the official O*NET database.- Go to the O*NET Database download page.
- Download the following files in Excel format:
- Job Zones: Under the "Education, Experience, Training" section.
- Interests: Under the "Interests" section. This file contains the Holland Codes (RIASEC).
- Create a new CSV file named
onet_data_reference.csv. - The CSV file must have at least two columns:
onet_code: The O*NET-SOC code (e.g., "15-1252.00").job_zone: The Job Zone number.- You can include other columns from the downloaded files as needed.
- The script will look up the
job_zonebased on theonet_code.
-
Prepare your data: The
job_description_coding.Rscript currently uses a samplejob_datadataframe. You should replace this with your own data. Your dataframe must have a column namedjob_description. -
Run the R script:
- Open
job_description_coding.Rin RStudio. - Run the script line by line or all at once.
- Open
-
Manual Verification:
- The script will pause and ask for your input if it finds an uncertain match for a job title.
- Follow the prompts in the R console to select the correct O*NET code or provide one manually.
-
Get the results:
- After the script finishes, your
job_datadataframe will be updated with two new columns:onet_codeandjob_zone. - The script will also generate
input.jsonandoutput.jsonas intermediate files. These are safe to delete after the script has finished.
- After the script finishes, your
- The R script reads your job descriptions.
- If a DeepL API key is provided, it translates the job descriptions to English.
- It creates an
input.jsonfile containing your O*NET credentials and the list of job titles to query. - It calls the
ONET_coder.pyscript, which readsinput.json, queries the O*NET API, and saves the results tooutput.json. - The R script then reads
output.jsonand starts the interactive QA process. - Finally, it maps the verified O*NET codes to Job Zones using your
onet_data_reference.csvfile.