diff --git a/docs/devnotes/.authors.yml b/docs/devnotes/.authors.yml index 9e8c6818e..9a6208bd4 100644 --- a/docs/devnotes/.authors.yml +++ b/docs/devnotes/.authors.yml @@ -19,3 +19,7 @@ authors: name: Dhruv Nathawani description: Researcher at NVIDIA avatar: https://avatars.githubusercontent.com/u/128275431?v=4 + nmulepati: + name: Nabin Mulepati + description: Researcher at NVIDIA + avatar: https://avatars.githubusercontent.com/u/5551931?v=4 diff --git a/docs/devnotes/posts/images/push-to-hub-hero.png b/docs/devnotes/posts/images/push-to-hub-hero.png new file mode 100644 index 000000000..037da6418 Binary files /dev/null and b/docs/devnotes/posts/images/push-to-hub-hero.png differ diff --git a/docs/devnotes/posts/images/push-to-hub-pipeline.png b/docs/devnotes/posts/images/push-to-hub-pipeline.png new file mode 100644 index 000000000..de8133405 Binary files /dev/null and b/docs/devnotes/posts/images/push-to-hub-pipeline.png differ diff --git a/docs/devnotes/posts/images/push-to-hub-round-trip.png b/docs/devnotes/posts/images/push-to-hub-round-trip.png new file mode 100644 index 000000000..cc483ef8d Binary files /dev/null and b/docs/devnotes/posts/images/push-to-hub-round-trip.png differ diff --git a/docs/devnotes/posts/images/push-to-hub-schema-transform.png b/docs/devnotes/posts/images/push-to-hub-schema-transform.png new file mode 100644 index 000000000..c3c88a06f Binary files /dev/null and b/docs/devnotes/posts/images/push-to-hub-schema-transform.png differ diff --git a/docs/devnotes/posts/push-datasets-to-hugging-face-hub.md b/docs/devnotes/posts/push-datasets-to-hugging-face-hub.md new file mode 100644 index 000000000..32cf32c94 --- /dev/null +++ b/docs/devnotes/posts/push-datasets-to-hugging-face-hub.md @@ -0,0 +1,294 @@ +--- +date: 2026-02-26 +authors: + - nmulepati +--- + +# **Push Datasets to Hugging Face Hub** + +![Push to Hub Hero](images/push-to-hub-hero.png) + +You just generated 10k multilingual greetings (or some other cool dataset). Now what β€” email a parquet file? +Nah. Call `.push_to_hub()` and you've got a live dataset page on Hugging Face. Done and dusted 🚒. + + + +Here's the full flow β€” build a multilingual greeting dataset with a conversation +training processor, generate it, and push it to the Hub in one go: + +```python +import data_designer.config as dd +from data_designer.interface import DataDesigner + +data_designer = DataDesigner() +config_builder = dd.DataDesignerConfigBuilder() + +config_builder.add_column( + dd.SamplerColumnConfig( + name="language", + sampler_type=dd.SamplerType.CATEGORY, + params=dd.CategorySamplerParams( + values=["English", "Spanish", "French", "German", "Italian"], + ), + drop=True, + ) +) + +config_builder.add_column( + dd.LLMTextColumnConfig( + name="greeting", + model_alias="nvidia-text", + prompt="Write a casual greeting in {{ language }}.", + ) +) +config_builder.add_column( + dd.LLMTextColumnConfig( + name="response", + model_alias="nvidia-text", + prompt="Write a helpful agent response to this greeting: '{{ greeting }}'.", + ) +) + +# Reshape into an OpenAI-style conversation training format +config_builder.add_processor( + dd.SchemaTransformProcessorConfig( + name="conversations", + template={ + "messages": [ + {"role": "user", "content": "{{ greeting }}"}, + {"role": "assistant", "content": "{{ response }}"}, + ] + }, + ) +) + +results = data_designer.create(config_builder, num_records=10_000) + +# Ship it: +url = results.push_to_hub( + "my-org/multilingual-greetings", + "10k synthetic agent/user conversations across 5 languages.", + tags=["greetings", "multilingual", "conversation"], +) +print(url) # https://huggingface.co/datasets/my-org/multilingual-greetings +``` + +--- +## Two Ways In - same outcome + +**From results** (the happy path) β€” you just ran `.create()`, you have the +results object, call `.push_to_hub()` on it. + +**From a folder** (the "I closed my notebook" path) β€” you saved artifacts to +disk earlier and want to push them later: + +```python +from data_designer.integrations.huggingface import HuggingFaceHubClient + +url = HuggingFaceHubClient.push_to_hub_from_folder( + dataset_path="./my-saved-dataset", + repo_id="my-org/multilingual-greetings", + description="10k synthetic agent/user conversations across 5 languages.", +) +``` + +--- +## What Gets Uploaded + +![Push to Hub Pipeline](images/push-to-hub-pipeline.png) + +Everything. The upload pipeline runs in this order: + +``` +1. README.md ← auto-generated dataset card +2. data/*.parquet ← your main dataset (remapped from parquet-files/) +3. images/* ← if you have image columns (skipped otherwise) +4. {processor}/* ← processor outputs (remapped from processors-files/) +5. builder_config.json +6. metadata.json ← paths rewritten to match HF repo layout +``` + +Each step is its own commit on the HF repo, so you get a clean history. + +This is especially nice for large datasets. Data Designer writes output in +batched parquet partitions β€” generate 100k records and you'll have dozens of +parquet files across `parquet-files/`, `processors-files/`, and maybe `images/`. +Manually uploading all of that, organizing it into the right HF repo structure, +writing the dataset card YAML configs, and rewriting metadata paths would be +tedious and error-prone. `push_to_hub` handles the whole thing in one call β€” +folder uploads, path remapping, config registration, dataset card generation, +all of it. + +Re-pushing to the same `repo_id` updates the existing repo β€” no need to delete +and recreate. + +--- +## Processors Get First-Class Treatment + +![Schema Transform for Conversation Training](images/push-to-hub-schema-transform.png) + +Notice the `SchemaTransformProcessorConfig` in the example above. That's doing +the heavy lifting β€” it takes the raw `greeting` and `response` columns and +reshapes each row into an OpenAI-style `messages` array: + +```python +config_builder.add_processor( + dd.SchemaTransformProcessorConfig( + name="conversations", + template={ + "messages": [ + {"role": "user", "content": "{{ greeting }}"}, + {"role": "assistant", "content": "{{ response }}"}, + ] + }, + ) +) +``` + +The template is Jinja2 all the way down. Keys become columns in the output, +values get rendered per-row with the actual column data. The template dict must +be JSON-serializable β€” strings, lists, nested objects, all fair game. So you can +build arbitrarily complex conversation schemas (multi-turn, system prompts, +tool calls) just by adding more entries to the `messages` list. + +The processor runs after each batch and writes its output to a separate parquet +file alongside the main dataset. The main dataset (`data/`) still has the raw +columns β€” the processor output is an *additional* view, not a replacement. + +**When you push to hub, each processor gets its own top-level directory and its +own HF dataset config.** So the `conversations` processor from our example ends +up like this on HF: + +``` +my-org/multilingual-greetings/ +β”œβ”€β”€ README.md +β”œβ”€β”€ data/ +β”‚ β”œβ”€β”€ batch_00000.parquet ← raw columns (greeting, response) +β”‚ └── batch_00001.parquet +β”œβ”€β”€ conversations/ +β”‚ β”œβ”€β”€ batch_00000.parquet ← transformed (messages array) +β”‚ └── batch_00001.parquet +β”œβ”€β”€ builder_config.json +└── metadata.json +``` + +The dataset card YAML frontmatter registers each processor as its own named +config: + +```yaml +configs: +- config_name: data + data_files: "data/*.parquet" + default: true +- config_name: conversations + data_files: "conversations/*.parquet" +``` + +So consumers grab exactly the format they need: + +```python +from datasets import load_dataset + +# Raw columns β€” good for analysis +df = load_dataset("my-org/multilingual-greetings", "data", split="train") + +# Conversation format β€” ready for fine-tuning +df_conv = load_dataset("my-org/multilingual-greetings", "conversations", split="train") +print(df_conv[0]) +# {'messages': [{'role': 'user', 'content': 'Hey! Como estΓ‘s?'}, +# {'role': 'assistant', 'content': 'Hola! Estoy bien, gracias...'}]} +``` + +The Quick Start section in the generated README includes these snippets +automatically β€” one `load_dataset` call per processor. + +**Metadata paths are rewritten too.** Local paths like +`processors-files/conversations/batch_00000.parquet` become +`conversations/batch_00000.parquet` so file references in the metadata match +the actual HF repo structure. + +If there are no processors, all of this is silently skipped β€” no empty +directories, no phantom configs. + +--- +## The Auto-Generated Dataset Card + +This is the fun part. The upload generates a full HuggingFace dataset card from +your run metadata. It pulls from `metadata.json` and `builder_config.json` to +build: + +- A **Quick Start** section with `load_dataset` code (including processor subsets) +- A **Dataset Summary** with record count, column count, completion % +- A **Schema & Statistics** table β€” per-column type, uniqueness, null rate, token stats +- **Generation Details** β€” how many columns of each config type +- A **Citation** block so people can cite your dataset + +Tags default to `["synthetic", "datadesigner"]` plus whatever you pass in. +Size category (`n<1K`, `1K