Skip to content

timseidel/onet_holland_codes

Repository files navigation

O*NET Holland Codes Coder

This project provides a workflow to code job descriptions with ONET-SOC codes and their corresponding Job Zones (related to Holland Codes). It uses a combination of Python and R scripts to automate the process, from translating job titles to English to fetching ONET data and allowing for manual verification.

Features

  • Automated O*NET Coding: Fetches O*NET-SOC codes for a list of job titles using the O*NET API.
  • Translation Support: Can translate job titles from other languages to English using the DeepL API before coding.
  • Interactive QA: Includes a manual verification step in R for low-confidence matches, ensuring data quality.
  • Configuration-based: API keys and settings are managed in a config.json file, not hardcoded in the scripts.
  • Modular Design: Separates the O*NET API interaction (Python) from data processing and user interaction (R).

File Structure

  • ONET_coder.py: A Python script that connects to the O*NET API and retrieves SOC codes for a given list of queries.
  • job_description_coding.R: The main R script that orchestrates the entire workflow, including data preparation, translation, calling the Python script, and manual data verification.
  • config.json.template: A template for the configuration file. You need to copy this to config.json and add your API keys.
  • .gitignore: Excludes sensitive and intermediate files from version control.
  • README.md: This file.

Prerequisites

  • R and RStudio: You will need R and preferably RStudio installed on your system.
  • Python 3: The ONET_coder.py script is written in Python 3.
  • API Keys:
    • O*NET Web Services: You need to register for a free account at the O*NET Developer Center to get a username and password.
    • DeepL API (Optional): If you need to translate job descriptions, you'll need a DeepL API key. A free plan is available.

Setup

  1. Clone the repository:

    git clone <repository-url>
    cd ONET_Holland_Codes
  2. Install R Packages: The job_description_coding.R script will automatically try to install the required R packages when you run it for the first time. The required packages are: reticulate, deeplr, jsonlite, and rstudioapi.

  3. Create the Configuration File:

    • Make a copy of the config.json.template file and rename it to config.json.
    • Open config.json and fill in your O*NET username/password and your DeepL API key.
    {
      "onet_username": "YOUR_ONET_USERNAME",
      "onet_password": "YOUR_ONET_PASSWORD",
      "deepl_auth_key": "YOUR_DEEPL_AUTH_KEY"
    }

    If you don't need translation, you can leave deepl_auth_key empty.

  4. Prepare the O*NET Reference Data: The script requires a file named onet_data_reference.csv in the root directory. The file is used to map O*NET codes to Job Zones. Although the 2024 Version is provided with the repository, please create a current version of this file yourself from the official O*NET database.

    • Go to the O*NET Database download page.
    • Download the following files in Excel format:
      • Job Zones: Under the "Education, Experience, Training" section.
      • Interests: Under the "Interests" section. This file contains the Holland Codes (RIASEC).
    • Create a new CSV file named onet_data_reference.csv.
    • The CSV file must have at least two columns:
      • onet_code: The O*NET-SOC code (e.g., "15-1252.00").
      • job_zone: The Job Zone number.
      • You can include other columns from the downloaded files as needed.
    • The script will look up the job_zone based on the onet_code.

How to Run

  1. Prepare your data: The job_description_coding.R script currently uses a sample job_data dataframe. You should replace this with your own data. Your dataframe must have a column named job_description.

  2. Run the R script:

    • Open job_description_coding.R in RStudio.
    • Run the script line by line or all at once.
  3. Manual Verification:

    • The script will pause and ask for your input if it finds an uncertain match for a job title.
    • Follow the prompts in the R console to select the correct O*NET code or provide one manually.
  4. Get the results:

    • After the script finishes, your job_data dataframe will be updated with two new columns: onet_code and job_zone.
    • The script will also generate input.json and output.json as intermediate files. These are safe to delete after the script has finished.

How it Works

  1. The R script reads your job descriptions.
  2. If a DeepL API key is provided, it translates the job descriptions to English.
  3. It creates an input.json file containing your O*NET credentials and the list of job titles to query.
  4. It calls the ONET_coder.py script, which reads input.json, queries the O*NET API, and saves the results to output.json.
  5. The R script then reads output.json and starts the interactive QA process.
  6. Finally, it maps the verified O*NET codes to Job Zones using your onet_data_reference.csv file.

About

semi-automatic coding of free-text job descriptions to o*net and holland codes

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published