This repository provides tools for generating synthetic datasets using LLM for Natural Language Processing (NLP) tasks such as Intent Detection, Named Entity Recognition (NER), and Text Similarity (STS). More specifically, we focus on evaluating the efficacy of synthetic dataset as a benchmark. For more information, please refer to our paper.
SG4NLP/
│
├── data/
│ ├── generated/ # Generated datasets
│ ├── intent_dataset/ # Datasets for intent recognition
│ ├── ner_dataset/ # Datasets for named entity recognition
│ └── sts_dataset/ # Datasets for text similarity
│
├── src/
│ ├── generate_dataset/ # Dataset generation modules
│ ├── method/ # Prediction methods
│ └── parse_dataset/ # Dataset parsing utilities
│
├── mlflow.zip # Zip file containing predictions for different methods and tasks (must be unzipped)
└── config.py # Configuration file for running the pipelines- To run a specific configuration, modify the parameters in the
config.pyfile and execute the pipeline usingrunner.py: - For running multiple configurations and tasks (including data generation, testing, and
evaluations over the original datasets), use the respective runner scripts:
ner_runner.pyfor Name Entity Recognitionintent_detection_runner.pyfor Intent Detectiontext_similarity_runner.pyfor Text Similarity
- All predictions are stored in the mlflow folder. After unzipping the
mlflow.zip, you can analyze the results using the following Jupyter notebook:
mlflow_results_analysis.ipynb- Synthetic datasets are saved in the
data/generated/folder in pickle format. - The filename structure provides detailed information about the generated dataset
DatasetName_LLM_ExamplesShownPerClass_TotalNumberOfExampleGenerated.pklFor example:
atis_gpt-4o-mini_1_200.pkl: atis dataset, generated by gpt-40-mini, seeing 1 example for each score, and 200 examples were generated
To load and explore the generated datasets, use the generated_dataset_access.py
script in the src directory
All the prompts associated with generation and benchmarking are available
in prompts.py file
- Anyscale endpoint stopped working after the project. We recommend using grok, hyperscale, or custom endpoint as a replacement
- .env file to add all the api access key