This repository contains the official implementation for the SIGMOD 2026 paper:
AutoDDG: Automated Dataset Description Generation using Large Language Models
AutoDDG is an automated system for generating comprehensive, accurate, readable, and concise dataset descriptions. The framework combines a data-driven approach to summarize dataset contents with large language models (LLMs) to enrich summaries with semantic information and produce human-readable descriptions. AutoDDG supports both API-based (OpenAI) and local LLM (transformers) modes, providing flexibility for different deployment scenarios.
Clone the repository and install dependencies via uv (recommended):
git clone https://github.com/VIDA-NYU/AutoDDG.git
cd AutoDDG
uv sync
# If you do not have uv installed:
# * `curl -LsSf https://astral.sh/uv/install.sh | sh`
# * or look at https://docs.astral.sh/uv/getting-started/installation/Then launch Jupyter Lab to explore:
uv run --with jupyter jupyter labAlternatively, install directly via pip:
pip install git+https://github.com/VIDA-NYU/AutoDDG@mainFor local LLM support (Qwen, Llama, etc.), install with optional dependencies:
Using uv (recommended):
uv sync --extra local-llmUsing pip:
pip install git+https://github.com/VIDA-NYU/AutoDDG@main[local-llm]
# or
pip install git+https://github.com/VIDA-NYU/AutoDDG@main transformers torchCaution
This installation method is temporary. A PyPI release of AutoDDG will soon be available. The git+https method will be deprecated in favor of the PyPI index.
AutoDDG supports both API-based (OpenAI) and local LLM (transformers) modes.
The simplest way to use AutoDDG is with an OpenAI API client:
from openai import OpenAI
from autoddg import AutoDDG
# Setup OpenAI client
client = OpenAI(api_key="sk-...")
# Initialize AutoDDG
autoddg = AutoDDG(client=client, model_name="gpt-4o-mini")
# Generate description from a small CSV sample
sample_csv = """Case_ID,Age,BMI
C3L-00004,72,22.8
C3L-00010,30,34.15
"""
prompt, description = autoddg.describe_dataset(dataset_sample=sample_csv)
print(description)
# >>> This dataset contains medical information about patients, including their unique Case_ID, Age, and Body Mass Index (BMI). etc.AutoDDG also supports local LLMs via transformers (Qwen, Llama, etc.):
from autoddg import AutoDDG
# Initialize AutoDDG with local LLM
autoddg = AutoDDG(
client=None,
model_name="Qwen/Qwen2.5-7B-Instruct", # or any HuggingFace model
use_local_llm=True,
local_llm_device="cuda", # or "cpu" if no GPU
local_llm_dtype="bfloat16", # or "float16", "float32"
)
# Generate description
sample_csv = """Case_ID,Age,BMI
C3L-00004,72,22.8
C3L-00010,30,34.15
"""
prompt, description = autoddg.describe_dataset(dataset_sample=sample_csv)
print(description)Note: For local LLM support, ensure you have installed the optional dependencies:
pip install transformers torchAutoDDG provides multiple processing modes for semantic profiling to optimize performance based on your use case:
| Mode | OpenAI API | Local LLM | Description |
|---|---|---|---|
| Sequential | ✅ | ✅ | Default mode, processes columns one by one |
| Multi-threading | ✅ | ❌ | Concurrent processing for faster execution |
| Group-prompting | ✅ | ✅ | Processes multiple columns in one prompt |
| Batch processing | ❌ | ✅ | Efficient GPU utilization for local models |
The default mode processes columns sequentially. Works with both API and local LLMs:
# Sequential mode (default)
semantic_profile = autoddg.analyze_semantics(dataframe)Use multi-threading to process columns concurrently for faster execution. Only available for OpenAI API clients:
# Multi-threading mode (OpenAI API only)
semantic_profile = autoddg.analyze_semantics(
dataframe,
use_multi_threading=True,
max_workers=32, # Optional: number of concurrent workers
)Process multiple columns in a single prompt to reduce API calls. Efficient for both API and local LLMs:
# Group-prompting: process all columns at once
semantic_profile = autoddg.analyze_semantics(
dataframe,
use_group_prompting=True,
group_size=0, # 0 = all columns at once, >0 = group size
)
# Or process in groups of 5 columns
semantic_profile = autoddg.analyze_semantics(
dataframe,
use_group_prompting=True,
group_size=5,
)For local LLMs, use batch processing for efficient GPU utilization. Only available for local LLMs:
# Batch processing mode (Local LLM only)
semantic_profile = autoddg.analyze_semantics(
dataframe,
use_batch_processing=True,
batch_size=32, # Number of columns to process per batch
)Important Notes:
- Multi-threading is only available for OpenAI API clients
- Batch processing is only available for local LLMs
- Group-prompting works with both API and local LLMs
- Batch processing takes precedence over other modes if enabled
For a much better introduction, we highly recommend starting with the quick_start notebook with an example dataset.
If you use AutoDDG in your research, please cite our work:
@misc{2502.01050,
Author = {Haoxiang Zhang and Yurong Liu and Wei-Lun Hung and Aécio Santos and Juliana Freire},
Title = {AutoDDG: Automated Dataset Description Generation using Large Language Models},
Year = {2025},
Eprint = {arXiv:2502.01050},
}AutoDDG is released under the Apache License 2.0.