Skip to content
2 changes: 2 additions & 0 deletions docs/inference-providers/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@
title: Overview
- local: integrations/adding-integration
title: Add Your Integration
- local: integrations/datadesigner
title: NeMo Data Designer
- local: integrations/macwhisper
title: MacWhisper
- local: integrations/opencode
Expand Down
105 changes: 105 additions & 0 deletions docs/inference-providers/integrations/datadesigner.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# NeMo Data Designer

[DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) is NVIDIA NeMo's framework for generating high-quality synthetic datasets using LLMs. It enables you to create diverse data using statistical samplers, LLMs, or existing seed datasets while maintaining control over field relationships and data quality.

## Overview

DataDesigner supports OpenAI-compatible endpoints, making it easy to use any model available through Hugging Face Inference Providers for synthetic data generation.

## Prerequisites

- DataDesigner installed (`pip install data-designer`)
- A Hugging Face account with [API token](https://huggingface.co/settings/tokens/new?ownUserPermissions=inference.serverless.write&tokenType=fineGrained) (needs "Make calls to Inference Providers" permission)

## Configuration

### 1. Set your HF token

```bash
export HF_TOKEN="hf_your_token_here"
```

### 2. Configure HF as a provider

```python
from data_designer.essentials import (
CategorySamplerParams,
DataDesigner,
DataDesignerConfigBuilder,
LLMTextColumnConfig,
ModelConfig,
ModelProvider,
SamplerColumnConfig,
SamplerType,
)

# Define HF Inference Provider (OpenAI-compatible)
hf_provider = ModelProvider(
name="huggingface",
endpoint="https://router.huggingface.co/v1",
provider_type="openai",
api_key="HF_TOKEN", # Reads from environment variable
)

# Define a model available via HF Inference Providers
hf_model = ModelConfig(
alias="hf-gpt-oss",
model="openai/gpt-oss-120b",
provider="huggingface",
)

# Create DataDesigner with HF provider
data_designer = DataDesigner(model_providers=[hf_provider])
config_builder = DataDesignerConfigBuilder(model_configs=[hf_model])
```

### 3. Generate synthetic data

```python
# Add a sampler column
config_builder.add_column(
SamplerColumnConfig(
name="category",
sampler_type=SamplerType.CATEGORY,
params=CategorySamplerParams(
values=["Electronics", "Books", "Clothing"],
),
)
)

# Add an LLM-generated column
config_builder.add_column(
LLMTextColumnConfig(
name="product_name",
model_alias="hf-gpt-oss",
prompt="Generate a creative product name for a {{ category }} item.",
)
)

# Preview the generated data
preview = data_designer.preview(config_builder=config_builder, num_records=5)
preview.display_sample_record()

# Access the DataFrame
df = preview.dataset
print(df)
```

## Using Different Models

You can use any model available through [Inference Providers](https://huggingface.co/models?inference_provider=all). Simply update the `model` field:

```python
# Use a different model
hf_model = ModelConfig(
alias="hf-olmo",
model="allenai/OLMo-3-7B-Instruct",
provider="huggingface",
)
```

## Resources

- [DataDesigner Documentation](https://nvidia-nemo.github.io/DataDesigner/)
- [GitHub Repository](https://github.com/NVIDIA-NeMo/DataDesigner)
- [Available Models on Inference Providers](https://huggingface.co/models?inference_provider=all&pipeline_tag=text-generation)
7 changes: 7 additions & 0 deletions docs/inference-providers/integrations/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ This table lists _some_ tools, libraries, and applications that work with Huggin
| Integration | Description | Resources |
| ------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
| [CrewAI](https://www.crewai.com/) | Framework for orchestrating AI agent teams | [Official docs](https://docs.crewai.com/en/concepts/llms#hugging-face) |
| [NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) | Synthetic dataset generation framework | [HF docs](./datadesigner) |
| [GitHub Copilot Chat](https://docs.github.com/en/copilot) | AI pair programmer in VS Code | [HF docs](./vscode) |
| [fast-agent](https://fast-agent.ai/) | Flexible framework building MCP/ACP powered Agents, Workflows and evals | [Official docs](https://fast-agent.ai/models/llm_providers/#hugging-face) |
| [Haystack](https://haystack.deepset.ai/) | Open-source LLM framework for building production applications | [Official docs](https://docs.haystack.deepset.ai/docs/huggingfaceapichatgenerator) |
Expand Down Expand Up @@ -71,6 +72,12 @@ LLM application frameworks and orchestration platforms.
- [PydanticAI](https://ai.pydantic.dev/) - Framework for building AI agents with Python ([Official docs](https://ai.pydantic.dev/models/huggingface/))
- [smolagents](https://huggingface.co/docs/smolagents) - Framework for building LLM agents with tool integration ([Official docs](https://huggingface.co/docs/smolagents/reference/models#smolagents.InferenceClientModel))

### Synthetic Data

Tools for creating synthetic datasets.

- [NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) - NVIDIA NeMo framework for synthetic data generation ([HF docs](./datadesigner))

<!-- ## Add Your Integration

Building something with Inference Providers? [Let us know](./adding-integration) and we'll add it to the list. -->