diff --git a/docs/inference-providers/_toctree.yml b/docs/inference-providers/_toctree.yml index c2829c3c2..eb2b7ab1f 100644 --- a/docs/inference-providers/_toctree.yml +++ b/docs/inference-providers/_toctree.yml @@ -38,6 +38,8 @@ title: Overview - local: integrations/adding-integration title: Add Your Integration + - local: integrations/datadesigner + title: NeMo Data Designer - local: integrations/macwhisper title: MacWhisper - local: integrations/opencode diff --git a/docs/inference-providers/integrations/datadesigner.md b/docs/inference-providers/integrations/datadesigner.md new file mode 100644 index 000000000..30b5d113d --- /dev/null +++ b/docs/inference-providers/integrations/datadesigner.md @@ -0,0 +1,105 @@ +# NeMo Data Designer + +[DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) is NVIDIA NeMo's framework for generating high-quality synthetic datasets using LLMs. It enables you to create diverse data using statistical samplers, LLMs, or existing seed datasets while maintaining control over field relationships and data quality. + +## Overview + +DataDesigner supports OpenAI-compatible endpoints, making it easy to use any model available through Hugging Face Inference Providers for synthetic data generation. + +## Prerequisites + +- DataDesigner installed (`pip install data-designer`) +- A Hugging Face account with [API token](https://huggingface.co/settings/tokens/new?ownUserPermissions=inference.serverless.write&tokenType=fineGrained) (needs "Make calls to Inference Providers" permission) + +## Configuration + +### 1. Set your HF token + +```bash +export HF_TOKEN="hf_your_token_here" +``` + +### 2. Configure HF as a provider + +```python +from data_designer.essentials import ( + CategorySamplerParams, + DataDesigner, + DataDesignerConfigBuilder, + LLMTextColumnConfig, + ModelConfig, + ModelProvider, + SamplerColumnConfig, + SamplerType, +) + +# Define HF Inference Provider (OpenAI-compatible) +hf_provider = ModelProvider( + name="huggingface", + endpoint="https://router.huggingface.co/v1", + provider_type="openai", + api_key="HF_TOKEN", # Reads from environment variable +) + +# Define a model available via HF Inference Providers +hf_model = ModelConfig( + alias="hf-gpt-oss", + model="openai/gpt-oss-120b", + provider="huggingface", +) + +# Create DataDesigner with HF provider +data_designer = DataDesigner(model_providers=[hf_provider]) +config_builder = DataDesignerConfigBuilder(model_configs=[hf_model]) +``` + +### 3. Generate synthetic data + +```python +# Add a sampler column +config_builder.add_column( + SamplerColumnConfig( + name="category", + sampler_type=SamplerType.CATEGORY, + params=CategorySamplerParams( + values=["Electronics", "Books", "Clothing"], + ), + ) +) + +# Add an LLM-generated column +config_builder.add_column( + LLMTextColumnConfig( + name="product_name", + model_alias="hf-gpt-oss", + prompt="Generate a creative product name for a {{ category }} item.", + ) +) + +# Preview the generated data +preview = data_designer.preview(config_builder=config_builder, num_records=5) +preview.display_sample_record() + +# Access the DataFrame +df = preview.dataset +print(df) +``` + +## Using Different Models + +You can use any model available through [Inference Providers](https://huggingface.co/models?inference_provider=all). Simply update the `model` field: + +```python +# Use a different model +hf_model = ModelConfig( + alias="hf-olmo", + model="allenai/OLMo-3-7B-Instruct", + provider="huggingface", +) +``` + +## Resources + +- [DataDesigner Documentation](https://nvidia-nemo.github.io/DataDesigner/) +- [GitHub Repository](https://github.com/NVIDIA-NeMo/DataDesigner) +- [Available Models on Inference Providers](https://huggingface.co/models?inference_provider=all&pipeline_tag=text-generation) diff --git a/docs/inference-providers/integrations/index.md b/docs/inference-providers/integrations/index.md index c13a22941..a28cdaacd 100644 --- a/docs/inference-providers/integrations/index.md +++ b/docs/inference-providers/integrations/index.md @@ -18,6 +18,7 @@ This table lists _some_ tools, libraries, and applications that work with Huggin | Integration | Description | Resources | | ------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- | | [CrewAI](https://www.crewai.com/) | Framework for orchestrating AI agent teams | [Official docs](https://docs.crewai.com/en/concepts/llms#hugging-face) | +| [NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) | Synthetic dataset generation framework | [HF docs](./datadesigner) | | [GitHub Copilot Chat](https://docs.github.com/en/copilot) | AI pair programmer in VS Code | [HF docs](./vscode) | | [fast-agent](https://fast-agent.ai/) | Flexible framework building MCP/ACP powered Agents, Workflows and evals | [Official docs](https://fast-agent.ai/models/llm_providers/#hugging-face) | | [Haystack](https://haystack.deepset.ai/) | Open-source LLM framework for building production applications | [Official docs](https://docs.haystack.deepset.ai/docs/huggingfaceapichatgenerator) | @@ -71,6 +72,12 @@ LLM application frameworks and orchestration platforms. - [PydanticAI](https://ai.pydantic.dev/) - Framework for building AI agents with Python ([Official docs](https://ai.pydantic.dev/models/huggingface/)) - [smolagents](https://huggingface.co/docs/smolagents) - Framework for building LLM agents with tool integration ([Official docs](https://huggingface.co/docs/smolagents/reference/models#smolagents.InferenceClientModel)) +### Synthetic Data + +Tools for creating synthetic datasets. + +- [NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) - NVIDIA NeMo framework for synthetic data generation ([HF docs](./datadesigner)) +