Skip to content

Diabolocom-Research/SG4NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SG4NLP: Synthetic data Generation for NLP

This repository provides tools for generating synthetic datasets using LLM for Natural Language Processing (NLP) tasks such as Intent Detection, Named Entity Recognition (NER), and Text Similarity (STS). More specifically, we focus on evaluating the efficacy of synthetic dataset as a benchmark. For more information, please refer to our paper.


Repository Structure

SG4NLP/
│
├── data/
│   ├── generated/                 # Generated datasets
│   ├── intent_dataset/            # Datasets for intent recognition
│   ├── ner_dataset/               # Datasets for named entity recognition
│   └── sts_dataset/               # Datasets for text similarity
│
├── src/
│   ├── generate_dataset/          # Dataset generation modules
│   ├── method/                    # Prediction methods
│   └── parse_dataset/             # Dataset parsing utilities
│
├── mlflow.zip                     # Zip file containing predictions for different methods and tasks (must be unzipped)
└── config.py                      # Configuration file for running the pipelines

Running the Pipeline

  • To run a specific configuration, modify the parameters in the config.py file and execute the pipeline using runner.py:
  • For running multiple configurations and tasks (including data generation, testing, and evaluations over the original datasets), use the respective runner scripts:
    • ner_runner.py for Name Entity Recognition
    • intent_detection_runner.py for Intent Detection
    • text_similarity_runner.py for Text Similarity

Benchmarking

  • All predictions are stored in the mlflow folder. After unzipping the mlflow.zip, you can analyze the results using the following Jupyter notebook:
 mlflow_results_analysis.ipynb

Generated Datasets

  • Synthetic datasets are saved in the data/generated/ folder in pickle format.
  • The filename structure provides detailed information about the generated dataset
DatasetName_LLM_ExamplesShownPerClass_TotalNumberOfExampleGenerated.pkl

For example:

  • atis_gpt-4o-mini_1_200.pkl: atis dataset, generated by gpt-40-mini, seeing 1 example for each score, and 200 examples were generated

To load and explore the generated datasets, use the generated_dataset_access.py script in the src directory

Prompts

All the prompts associated with generation and benchmarking are available in prompts.py file

Notes

  • Anyscale endpoint stopped working after the project. We recommend using grok, hyperscale, or custom endpoint as a replacement
  • .env file to add all the api access key

About

Synthetic data generation for NLP

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published