SG4NLP: Synthetic data Generation for NLP

This repository provides tools for generating synthetic datasets using LLM for Natural Language Processing (NLP) tasks such as Intent Detection, Named Entity Recognition (NER), and Text Similarity (STS). More specifically, we focus on evaluating the efficacy of synthetic dataset as a benchmark. For more information, please refer to our paper.

Repository Structure

SG4NLP/
│
├── data/
│   ├── generated/                 # Generated datasets
│   ├── intent_dataset/            # Datasets for intent recognition
│   ├── ner_dataset/               # Datasets for named entity recognition
│   └── sts_dataset/               # Datasets for text similarity
│
├── src/
│   ├── generate_dataset/          # Dataset generation modules
│   ├── method/                    # Prediction methods
│   └── parse_dataset/             # Dataset parsing utilities
│
├── mlflow.zip                     # Zip file containing predictions for different methods and tasks (must be unzipped)
└── config.py                      # Configuration file for running the pipelines

Running the Pipeline

To run a specific configuration, modify the parameters in the config.py file and execute the pipeline using runner.py:
For running multiple configurations and tasks (including data generation, testing, and evaluations over the original datasets), use the respective runner scripts:
- ner_runner.py for Name Entity Recognition
- intent_detection_runner.py for Intent Detection
- text_similarity_runner.py for Text Similarity

Benchmarking

All predictions are stored in the mlflow folder. After unzipping the mlflow.zip, you can analyze the results using the following Jupyter notebook:

 mlflow_results_analysis.ipynb

Generated Datasets

Synthetic datasets are saved in the data/generated/ folder in pickle format.
The filename structure provides detailed information about the generated dataset

DatasetName_LLM_ExamplesShownPerClass_TotalNumberOfExampleGenerated.pkl

For example:

atis_gpt-4o-mini_1_200.pkl: atis dataset, generated by gpt-40-mini, seeing 1 example for each score, and 200 examples were generated

To load and explore the generated datasets, use the generated_dataset_access.py script in the src directory

Prompts

All the prompts associated with generation and benchmarking are available in prompts.py file

Notes

Anyscale endpoint stopped working after the project. We recommend using grok, hyperscale, or custom endpoint as a replacement
.env file to add all the api access key

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
mlflow_results_analysis.ipynb		mlflow_results_analysis.ipynb
mlruns.zip		mlruns.zip
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SG4NLP: Synthetic data Generation for NLP

Repository Structure

Running the Pipeline

Benchmarking

Generated Datasets

Prompts

Notes

About

Uh oh!

Releases

Packages

Languages

Diabolocom-Research/SG4NLP

Folders and files

Latest commit

History

Repository files navigation

SG4NLP: Synthetic data Generation for NLP

Repository Structure

Running the Pipeline

Benchmarking

Generated Datasets

Prompts

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages