From 2a8c4a7b17a9f2a98d797594103933b0f1593525 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Tue, 16 Dec 2025 17:18:31 +0000 Subject: [PATCH 1/9] Add DataDesigner integration for synthetic dataset generation --- docs/inference-providers/integrations/index.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/inference-providers/integrations/index.md b/docs/inference-providers/integrations/index.md index c13a22941..01997fcb7 100644 --- a/docs/inference-providers/integrations/index.md +++ b/docs/inference-providers/integrations/index.md @@ -18,6 +18,7 @@ This table lists _some_ tools, libraries, and applications that work with Huggin | Integration | Description | Resources | | ------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- | | [CrewAI](https://www.crewai.com/) | Framework for orchestrating AI agent teams | [Official docs](https://docs.crewai.com/en/concepts/llms#hugging-face) | +| [DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) | Synthetic dataset generation framework | [HF docs](./datadesigner) | | [GitHub Copilot Chat](https://docs.github.com/en/copilot) | AI pair programmer in VS Code | [HF docs](./vscode) | | [fast-agent](https://fast-agent.ai/) | Flexible framework building MCP/ACP powered Agents, Workflows and evals | [Official docs](https://fast-agent.ai/models/llm_providers/#hugging-face) | | [Haystack](https://haystack.deepset.ai/) | Open-source LLM framework for building production applications | [Official docs](https://docs.haystack.deepset.ai/docs/huggingfaceapichatgenerator) | @@ -71,6 +72,12 @@ LLM application frameworks and orchestration platforms. - [PydanticAI](https://ai.pydantic.dev/) - Framework for building AI agents with Python ([Official docs](https://ai.pydantic.dev/models/huggingface/)) - [smolagents](https://huggingface.co/docs/smolagents) - Framework for building LLM agents with tool integration ([Official docs](https://huggingface.co/docs/smolagents/reference/models#smolagents.InferenceClientModel)) +### Synthetic Data + +Tools for creating synthetic datasets. + +- [DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) - NVIDIA NeMo framework for synthetic data generation ([HF docs](./datadesigner)) + From b75dcecfa0dcd3c8fcda24969aa2eb45c6b8ca45 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Tue, 16 Dec 2025 17:19:51 +0000 Subject: [PATCH 2/9] Add documentation for DataDesigner integration with Hugging Face Inference Providers --- .../integrations/datadesigner.md | 105 ++++++++++++++++++ 1 file changed, 105 insertions(+) create mode 100644 docs/inference-providers/integrations/datadesigner.md diff --git a/docs/inference-providers/integrations/datadesigner.md b/docs/inference-providers/integrations/datadesigner.md new file mode 100644 index 000000000..22f3b1319 --- /dev/null +++ b/docs/inference-providers/integrations/datadesigner.md @@ -0,0 +1,105 @@ +# DataDesigner + +[DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) is NVIDIA NeMo's framework for generating high-quality synthetic datasets using LLMs. It enables you to create diverse data using statistical samplers, LLMs, or existing seed datasets while maintaining control over field relationships and data quality. + +## Overview + +DataDesigner supports OpenAI-compatible endpoints, making it easy to use any model available through Hugging Face Inference Providers for synthetic data generation. + +## Prerequisites + +- DataDesigner installed (`pip install data-designer`) +- A Hugging Face account with [API token](https://huggingface.co/settings/tokens/new?ownUserPermissions=inference.serverless.write&tokenType=fineGrained) (needs "Make calls to Inference Providers" permission) + +## Configuration + +### 1. Set your HF token + +```bash +export HF_TOKEN="hf_your_token_here" +``` + +### 2. Configure HF as a provider + +```python +from data_designer.essentials import ( + CategorySamplerParams, + DataDesigner, + DataDesignerConfigBuilder, + LLMTextColumnConfig, + ModelConfig, + ModelProvider, + SamplerColumnConfig, + SamplerType, +) + +# Define HF Inference Provider (OpenAI-compatible) +hf_provider = ModelProvider( + name="huggingface", + endpoint="https://router.huggingface.co/v1", + provider_type="openai", + api_key="HF_TOKEN", # Reads from environment variable +) + +# Define a model available via HF Inference Providers +hf_model = ModelConfig( + alias="hf-gpt-oss", + model="openai/gpt-oss-120b", + provider="huggingface", +) + +# Create DataDesigner with HF provider +data_designer = DataDesigner(model_providers=[hf_provider]) +config_builder = DataDesignerConfigBuilder(model_configs=[hf_model]) +``` + +### 3. Generate synthetic data + +```python +# Add a sampler column +config_builder.add_column( + SamplerColumnConfig( + name="category", + sampler_type=SamplerType.CATEGORY, + params=CategorySamplerParams( + values=["Electronics", "Books", "Clothing"], + ), + ) +) + +# Add an LLM-generated column +config_builder.add_column( + LLMTextColumnConfig( + name="product_name", + model_alias="hf-gpt-oss", + prompt="Generate a creative product name for a {{ category }} item.", + ) +) + +# Preview the generated data +preview = data_designer.preview(config_builder=config_builder, num_records=5) +preview.display_sample_record() + +# Access the DataFrame +df = preview.dataset +print(df) +``` + +## Using Different Models + +You can use any model available through [Inference Providers](https://huggingface.co/models?inference_provider=all). Simply update the `model` field: + +```python +# Use a different model +hf_model = ModelConfig( + alias="hf-qwen", + model="Qwen/Qwen2.5-72B-Instruct", + provider="huggingface", +) +``` + +## Resources + +- [DataDesigner Documentation](https://nvidia-nemo.github.io/DataDesigner/) +- [GitHub Repository](https://github.com/NVIDIA-NeMo/DataDesigner) +- [Available Models on Inference Providers](https://huggingface.co/models?inference_provider=all&pipeline_tag=text-generation) From ecd46aa3b9f77d6d80b288cf16a379c03d273795 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Tue, 16 Dec 2025 17:22:03 +0000 Subject: [PATCH 3/9] Add DataDesigner section to Inference Providers documentation --- docs/inference-providers/_toctree.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/inference-providers/_toctree.yml b/docs/inference-providers/_toctree.yml index c2829c3c2..424b4a8f2 100644 --- a/docs/inference-providers/_toctree.yml +++ b/docs/inference-providers/_toctree.yml @@ -44,6 +44,8 @@ title: OpenCode - local: integrations/vscode title: VS Code with GitHub Copilot + - local: integrations/datadesigner + title: DataDesigner - local: tasks/index title: Inference Tasks From fc011f5e23bedbe79c4d61ec7b08184518d71dc5 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Tue, 16 Dec 2025 17:26:36 +0000 Subject: [PATCH 4/9] fix order --- docs/inference-providers/_toctree.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/inference-providers/_toctree.yml b/docs/inference-providers/_toctree.yml index 424b4a8f2..269db8cc4 100644 --- a/docs/inference-providers/_toctree.yml +++ b/docs/inference-providers/_toctree.yml @@ -38,14 +38,14 @@ title: Overview - local: integrations/adding-integration title: Add Your Integration + - local: integrations/datadesigner + title: DataDesigner - local: integrations/macwhisper title: MacWhisper - local: integrations/opencode title: OpenCode - local: integrations/vscode title: VS Code with GitHub Copilot - - local: integrations/datadesigner - title: DataDesigner - local: tasks/index title: Inference Tasks From 3a0928508d00b4e437e4928fd73adb304d7a1c60 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Tue, 16 Dec 2025 17:28:27 +0000 Subject: [PATCH 5/9] Update model configuration in DataDesigner integration example --- docs/inference-providers/integrations/datadesigner.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/inference-providers/integrations/datadesigner.md b/docs/inference-providers/integrations/datadesigner.md index 22f3b1319..07a73df1d 100644 --- a/docs/inference-providers/integrations/datadesigner.md +++ b/docs/inference-providers/integrations/datadesigner.md @@ -92,8 +92,8 @@ You can use any model available through [Inference Providers](https://huggingfac ```python # Use a different model hf_model = ModelConfig( - alias="hf-qwen", - model="Qwen/Qwen2.5-72B-Instruct", + alias="hf-olmo", + model="allenai/OLMo-3-7B-Instruct", provider="huggingface", ) ``` From f4a947a1464b0a682a1947a0aaada3044b115ce0 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Wed, 17 Dec 2025 15:36:49 +0000 Subject: [PATCH 6/9] Update docs/inference-providers/_toctree.yml Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> --- docs/inference-providers/_toctree.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/inference-providers/_toctree.yml b/docs/inference-providers/_toctree.yml index 269db8cc4..eb2b7ab1f 100644 --- a/docs/inference-providers/_toctree.yml +++ b/docs/inference-providers/_toctree.yml @@ -39,7 +39,7 @@ - local: integrations/adding-integration title: Add Your Integration - local: integrations/datadesigner - title: DataDesigner + title: NeMo Data Designer - local: integrations/macwhisper title: MacWhisper - local: integrations/opencode From a78547cf7ef79162639297a95d8bc344e120f446 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Wed, 17 Dec 2025 15:37:04 +0000 Subject: [PATCH 7/9] Update docs/inference-providers/integrations/datadesigner.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> --- docs/inference-providers/integrations/datadesigner.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/inference-providers/integrations/datadesigner.md b/docs/inference-providers/integrations/datadesigner.md index 07a73df1d..30b5d113d 100644 --- a/docs/inference-providers/integrations/datadesigner.md +++ b/docs/inference-providers/integrations/datadesigner.md @@ -1,4 +1,4 @@ -# DataDesigner +# NeMo Data Designer [DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) is NVIDIA NeMo's framework for generating high-quality synthetic datasets using LLMs. It enables you to create diverse data using statistical samplers, LLMs, or existing seed datasets while maintaining control over field relationships and data quality. From e8f2732fabc380b656da3e23e83bffceb1cdab1a Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Wed, 17 Dec 2025 15:37:15 +0000 Subject: [PATCH 8/9] Update docs/inference-providers/integrations/index.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> --- docs/inference-providers/integrations/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/inference-providers/integrations/index.md b/docs/inference-providers/integrations/index.md index 01997fcb7..de6cd82c4 100644 --- a/docs/inference-providers/integrations/index.md +++ b/docs/inference-providers/integrations/index.md @@ -18,7 +18,7 @@ This table lists _some_ tools, libraries, and applications that work with Huggin | Integration | Description | Resources | | ------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- | | [CrewAI](https://www.crewai.com/) | Framework for orchestrating AI agent teams | [Official docs](https://docs.crewai.com/en/concepts/llms#hugging-face) | -| [DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) | Synthetic dataset generation framework | [HF docs](./datadesigner) | +| [NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) | Synthetic dataset generation framework | [HF docs](./datadesigner) | | [GitHub Copilot Chat](https://docs.github.com/en/copilot) | AI pair programmer in VS Code | [HF docs](./vscode) | | [fast-agent](https://fast-agent.ai/) | Flexible framework building MCP/ACP powered Agents, Workflows and evals | [Official docs](https://fast-agent.ai/models/llm_providers/#hugging-face) | | [Haystack](https://haystack.deepset.ai/) | Open-source LLM framework for building production applications | [Official docs](https://docs.haystack.deepset.ai/docs/huggingfaceapichatgenerator) | From 323f634433c745b3bb353545f503f550003589c1 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Wed, 17 Dec 2025 15:37:26 +0000 Subject: [PATCH 9/9] Update docs/inference-providers/integrations/index.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> --- docs/inference-providers/integrations/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/inference-providers/integrations/index.md b/docs/inference-providers/integrations/index.md index de6cd82c4..a28cdaacd 100644 --- a/docs/inference-providers/integrations/index.md +++ b/docs/inference-providers/integrations/index.md @@ -76,7 +76,7 @@ LLM application frameworks and orchestration platforms. Tools for creating synthetic datasets. -- [DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) - NVIDIA NeMo framework for synthetic data generation ([HF docs](./datadesigner)) +- [NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) - NVIDIA NeMo framework for synthetic data generation ([HF docs](./datadesigner))