🏷️ AutoDDG: Automated Dataset Description Generation using Large Language Models

Overview

This repository contains the official implementation for the SIGMOD 2026 paper:

AutoDDG: Automated Dataset Description Generation using Large Language Models

AutoDDG is an automated system for generating comprehensive, accurate, readable, and concise dataset descriptions. The framework combines a data-driven approach to summarize dataset contents with large language models (LLMs) to enrich summaries with semantic information and produce human-readable descriptions. AutoDDG supports both API-based (OpenAI) and local LLM (transformers) modes, providing flexibility for different deployment scenarios.

Installation

Clone the repository and install dependencies via uv (recommended):

git clone https://github.com/VIDA-NYU/AutoDDG.git
cd AutoDDG
uv sync
# If you do not have uv installed:
# * `curl -LsSf https://astral.sh/uv/install.sh | sh`
# * or look at https://docs.astral.sh/uv/getting-started/installation/

Then launch Jupyter Lab to explore:

uv run --with jupyter jupyter lab

Alternatively, install directly via pip:

pip install git+https://github.com/VIDA-NYU/AutoDDG@main

For local LLM support (Qwen, Llama, etc.), install with optional dependencies:

Using uv (recommended):

uv sync --extra local-llm

Using pip:

pip install git+https://github.com/VIDA-NYU/AutoDDG@main[local-llm]
# or
pip install git+https://github.com/VIDA-NYU/AutoDDG@main transformers torch

Caution

This installation method is temporary. A PyPI release of AutoDDG will soon be available. The git+https method will be deprecated in favor of the PyPI index.

Getting Started

AutoDDG supports both API-based (OpenAI) and local LLM (transformers) modes.

Using OpenAI API

The simplest way to use AutoDDG is with an OpenAI API client:

from openai import OpenAI
from autoddg import AutoDDG

# Setup OpenAI client
client = OpenAI(api_key="sk-...")

# Initialize AutoDDG
autoddg = AutoDDG(client=client, model_name="gpt-4o-mini")

# Generate description from a small CSV sample
sample_csv = """Case_ID,Age,BMI
C3L-00004,72,22.8
C3L-00010,30,34.15
"""

prompt, description = autoddg.describe_dataset(dataset_sample=sample_csv)

print(description)
# >>> This dataset contains medical information about patients, including their unique Case_ID, Age, and Body Mass Index (BMI). etc.

Using Local LLM

AutoDDG also supports local LLMs via transformers (Qwen, Llama, etc.):

from autoddg import AutoDDG

# Initialize AutoDDG with local LLM
autoddg = AutoDDG(
    client=None,
    model_name="Qwen/Qwen2.5-7B-Instruct",  # or any HuggingFace model
    use_local_llm=True,
    local_llm_device="cuda",  # or "cpu" if no GPU
    local_llm_dtype="bfloat16",  # or "float16", "float32"
)

# Generate description
sample_csv = """Case_ID,Age,BMI
C3L-00004,72,22.8
C3L-00010,30,34.15
"""

prompt, description = autoddg.describe_dataset(dataset_sample=sample_csv)
print(description)

Note: For local LLM support, ensure you have installed the optional dependencies:

pip install transformers torch

Semantic Profiler Processing Modes

AutoDDG provides multiple processing modes for semantic profiling to optimize performance based on your use case:

Mode	OpenAI API	Local LLM	Description
Sequential	✅	✅	Default mode, processes columns one by one
Multi-threading	✅	❌	Concurrent processing for faster execution
Group-prompting	✅	✅	Processes multiple columns in one prompt
Batch processing	❌	✅	Efficient GPU utilization for local models

Sequential Mode (Default)

The default mode processes columns sequentially. Works with both API and local LLMs:

# Sequential mode (default)
semantic_profile = autoddg.analyze_semantics(dataframe)

Multi-threading Mode (OpenAI API Only)

Use multi-threading to process columns concurrently for faster execution. Only available for OpenAI API clients:

# Multi-threading mode (OpenAI API only)
semantic_profile = autoddg.analyze_semantics(
    dataframe,
    use_multi_threading=True,
    max_workers=32,  # Optional: number of concurrent workers
)

Group-prompting Mode (Both API and Local LLM)

Process multiple columns in a single prompt to reduce API calls. Efficient for both API and local LLMs:

# Group-prompting: process all columns at once
semantic_profile = autoddg.analyze_semantics(
    dataframe,
    use_group_prompting=True,
    group_size=0,  # 0 = all columns at once, >0 = group size
)

# Or process in groups of 5 columns
semantic_profile = autoddg.analyze_semantics(
    dataframe,
    use_group_prompting=True,
    group_size=5,
)

Batch Processing Mode (Local LLM Only)

For local LLMs, use batch processing for efficient GPU utilization. Only available for local LLMs:

# Batch processing mode (Local LLM only)
semantic_profile = autoddg.analyze_semantics(
    dataframe,
    use_batch_processing=True,
    batch_size=32,  # Number of columns to process per batch
)

Important Notes:

Multi-threading is only available for OpenAI API clients
Batch processing is only available for local LLMs
Group-prompting works with both API and local LLMs
Batch processing takes precedence over other modes if enabled

Quick Jupyter Notebook Start

For a much better introduction, we highly recommend starting with the quick_start notebook with an example dataset.

How to Cite

If you use AutoDDG in your research, please cite our work:

@misc{2502.01050,
Author = {Haoxiang Zhang and Yurong Liu and Wei-Lun Hung and Aécio Santos and Juliana Freire},
Title = {AutoDDG: Automated Dataset Description Generation using Large Language Models},
Year = {2025},
Eprint = {arXiv:2502.01050},
}

License

AutoDDG is released under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github		.github
examples		examples
src/autoddg		src/autoddg
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏷️ AutoDDG: Automated Dataset Description Generation using Large Language Models

Overview

Installation

Getting Started

Using OpenAI API

Using Local LLM

Semantic Profiler Processing Modes

Sequential Mode (Default)

Multi-threading Mode (OpenAI API Only)

Group-prompting Mode (Both API and Local LLM)

Batch Processing Mode (Local LLM Only)

Quick Jupyter Notebook Start

How to Cite

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

VIDA-NYU/AutoDDG

Folders and files

Latest commit

History

Repository files navigation

🏷️ AutoDDG: Automated Dataset Description Generation using Large Language Models

Overview

Installation

Getting Started

Using OpenAI API

Using Local LLM

Semantic Profiler Processing Modes

Sequential Mode (Default)

Multi-threading Mode (OpenAI API Only)

Group-prompting Mode (Both API and Local LLM)

Batch Processing Mode (Local LLM Only)

Quick Jupyter Notebook Start

How to Cite

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages