NVIDIA · voegtlel · Jan 16, 2026 · Oct 7, 2025 · Oct 7, 2025 · Nov 3, 2025
diff --git a/.github/workflows/documentation.yml b/.github/workflows/documentation.yml
@@ -21,7 +21,7 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: Checkout code
-        uses: actions/checkout@v4
+        uses: actions/checkout@v6
 
       - name: Install uv
         uses: astral-sh/setup-uv@v5
@@ -38,7 +38,7 @@ jobs:
           just docs
 
       - name: Upload artifact
-        uses: actions/upload-pages-artifact@v3
+        uses: actions/upload-pages-artifact@v4
         with:
           path: docs/build
 

diff --git a/.github/workflows/license_headers.yml b/.github/workflows/license_headers.yml
@@ -15,9 +15,9 @@ jobs:
 
     steps:
       - name: Checkout Repository
-        uses: actions/checkout@v3
+        uses: actions/checkout@v6
       - name: Set up Python
-        uses: actions/setup-python@v5
+        uses: actions/setup-python@v6
         with:
           python-version: 3.9
       - name: Install dependencies

diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -15,7 +15,7 @@ jobs:
       id-token: write  # This permission is mandatory for trusted publishing
     steps:
       - name: Checkout code
-        uses: actions/checkout@v4
+        uses: actions/checkout@v6
 
       - name: Install uv
         uses: astral-sh/setup-uv@v5
@@ -32,4 +32,4 @@ jobs:
           just build
 
       - name: Publish package
-        uses: pypa/gh-action-pypi-publish@release/v1
+        uses: pypa/gh-action-pypi-publish@release/v1.13  # release/v1.13
diff --git a/.github/workflows/ruff.yml b/.github/workflows/ruff.yml
@@ -15,7 +15,7 @@ jobs:
 
     steps:
     - name: Checkout code
-      uses: actions/checkout@v4
+      uses: actions/checkout@v6
 
     - name: Install uv
       uses: astral-sh/setup-uv@v5

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -15,7 +15,7 @@ jobs:
 
     steps:
     - name: Checkout code
-      uses: actions/checkout@v4
+      uses: actions/checkout@v6
 
     - name: Install uv
       uses: astral-sh/setup-uv@v5

diff --git a/docs/source/advanced/crude_datasets.md b/docs/source/advanced/crude_datasets.md
@@ -74,6 +74,34 @@ All it does, is to forward the key, restore key and flavors from the dict to the
 
 In a real use-case you will want to do a lot more here and we recommend keeping the cook methods in separate files and importing them where you define your TaskEncoder.
 
+### Using Media Metadata in Monolithic Datasets
+If you prepared your dataset with media metadata, you can access it in the cooker using the {py:meth}`get_media_metadata <megatron.energon.FileStore.get_media_metadata>` method of the {py:class}`FileStore <megatron.energon.FileStore>`. This will be faster than decoding the media files to infer the media metadata like video duration or image size.
+Here we are looking at the use case for primary datasets, but for [auxiliary datasets](aux-data), this is even more beneficial.
+
+#### Metadata from the Primary Dataset
+
+```python
+from megatron.energon import FileStore
+
+# ...
+
+@stateless
+@cooker(need_primary=True)
+def cook_media_metadata(sample: dict, primary: FileStore) -> TextSample:
+    """This cooker loads the media metadata for the current sample
+    and returns a text sample with the metadata description."""
+
+    # The `.` prefix is used to access the media metadata of the primary dataset,
+    # referring to a part of the current sample.
+    primary_media_metadata = primary.get_media_metadata(".wav")
+
+    return TextSample(
+        **basic_sample_keys(sample),
+        text=f"This audio has duration {primary_media_metadata.audio_duration} seconds",
+    )
+```
+
+
 (aux-data)=
 ## Auxiliary Data for Polylithic Datasets
 
@@ -165,6 +193,35 @@ def cook_text(sample: dict, primary: FileStore, foo_bar_source: FileStore) -> Te
 
 You can then retrieve files by their names from the primary dataset.
 
+
+### Using Media Metadata in Polylithic Datasets
+If you prepared your auxiliary dataset with media metadata, you can access it in the cooker using the {py:meth}`get_media_metadata <megatron.energon.FileStore.get_media_metadata>` method of the {py:class}`FileStore <megatron.energon.FileStore>`.
+This is much faster than reading the media files themselves to infer the media metadata like video duration or image size.
+Especially, if you are working with Lazy objects, you can defer loading the media files entirely until you actually need them.
+For example in {py:meth}`postencode_sample(self, sample: T_sample) -> T_encoded_sample <megatron.energon.TaskEncoder.postencode_sample>`, when using packing.
+
+```python
+from megatron.energon import FileStore
+
+# ...
+
+def cook_media_metadata(sample: dict, foo_bar_source: FileStore) -> TextSample:
+    # Use the image filename from the primary sample to get the media metadata from the auxiliary dataset
+    media_metadata = foo_bar_source.get_media_metadata(sample['image'])
+
+    return TextSample(
+        **basic_sample_keys(sample),
+        text=f"This image has size {media_metadata.width}x{media_metadata.height} and format {media_metadata.format}",
+    )
+
+```
+
+The path of the media file within the auxiliary dataset, is what needs to be passed to the `get_media_metadata` method, for example `0001.jpg` or `images/0001.jpg` if within a subfoler inside the WebDataset or filesystem dataset.
+
+The return value will be either {py:class}`AVMetadata <megatron.energon.media.AVMetadata>` or {py:class}`ImageMetadata <megatron.energon.media.ImageMetadata>`.
+Click on them to see the fields and their types.
+
+
 (cache-pools)=
 ## Cache Pools
 

diff --git a/docs/source/advanced/data_prep_api.md b/docs/source/advanced/data_prep_api.md
@@ -0,0 +1,70 @@
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+
+# Programmatic Data Preparation
+
+Below, we provide an example function that shows how to prepare a dataset programmatically.
+
+```python
+from megatron.energon.flavors import BaseWebdatasetFactory
+
+def prepare_one_dataset(path: Path, num_workers: int, template_dir: Path):
+
+    if (path / ".nv-meta" / "dataset.yaml").exists():
+        print(f"Dataset {path} already prepared. Skipping.")
+        return
+
+    # Fixed settings
+    tar_index_only = False
+    split_parts_ratio = [("train", 1), ("val", 0), ("test", 0)]
+    split_parts_patterns = None
+
+    # Get all tar files
+    all_tars = list(path.glob("**/*.tar")) + list(path.glob("**/*.tgz"))
+    all_tars = [str(p.relative_to(path)) for p in sorted(all_tars)]
+
+    if len(all_tars) == 0:
+        print("Did not find any tar files. Exiting.")
+        return
+
+    print(f"Found {len(all_tars)} tar files in total. The first and last ones are:")
+    print(f"- {all_tars[0]}")
+    print(f"- {all_tars[-1]}")
+
+    def progress_fn(els, length=None):
+        with click.progressbar(
+            els,
+            label="Indexing shards",
+            show_pos=True,
+            length=length,
+        ) as bar:
+            for el in bar:
+                yield el
+
+    found_types, duplicates = BaseWebdatasetFactory.prepare_dataset(
+        path,
+        all_tars,
+        split_parts_ratio=split_parts_ratio,
+        split_parts_patterns=split_parts_patterns,
+        progress_fn=progress_fn,
+        tar_index_only=tar_index_only,
+        shuffle_seed=None,
+        workers=num_workers,
+    )
+
+    # Copy sample loader and dataset.yaml templates
+    for file in template_dir.glob("*"):
+        shutil.copy(file, path / ".nv-meta" / file.name)
+```
+
+Example usage:
+
+First, create a template directory with the `dataset.yaml` file, and optionally the `sample_loader.py` file.
+Let's call it `template_dir`.
+
+Then, run the script:
+
+```python
+if __name__ == "__main__":
+    prepare_one_dataset(Path("/path/to/dataset"), 16, Path("/path/to/template_dir"))
+```
diff --git a/docs/source/advanced/error_handling.md b/docs/source/advanced/error_handling.md
@@ -0,0 +1,159 @@
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+
+# Error Handling
+
+Energon provides configurable error handling to gracefully manage exceptions during data loading and processing.
+By default, errors in task encoder code are logged and the sample is skipped, allowing training to continue.
+Some system exceptions (clearly programming errors) always cause immediate failure.
+
+By default, if more than 100 samples in a row raise an error, processing will also fail. This is configurable for the
+{py:class}`~megatron.energon.task_encoder.base.TaskEncoder` user-functions by setting the `@stateless(failure_tolerance=100)` parameter.
+
+## Error Handler Configuration
+
+Error handlers are configured via {py:class}`WorkerConfig <megatron.energon.WorkerConfig>`:
+
+```python
+from megatron.energon import WorkerConfig, log_exception, reraise_exception
+
+worker_config = WorkerConfig(
+    rank=0,
+    world_size=1,
+    num_workers=4,
+    global_error_handler=log_exception,  # Default: logs and skips failed samples
+    restore_error_handler=reraise_exception,  # Default: fails fast on restore
+)
+```
+
+**`global_error_handler`:** Handles exceptions during normal iteration (training/validation).
+Default is {py:func}`log_exception <megatron.energon.log_exception>` which logs details and skips the sample.
+
+**`restore_error_handler`:** Handles exceptions during checkpoint restoration.
+Default is {py:func}`reraise_exception <megatron.energon.reraise_exception>` which fails fast to ensure checkpoint integrity.
+
+## System Exceptions
+
+Programming errors and critical system issues always cause immediate failure and are never handled by error handlers:
+`SystemError`, `SyntaxError`, `ImportError`, `StopIteration`, `StopAsyncIteration`, `MemoryError`, `RecursionError`, `ReferenceError`, `NameError`, `UnboundLocalError`, and {py:exc}`FatalSampleError <megatron.energon.FatalSampleError>`.
+{py:exc}`FatalSampleError <megatron.energon.FatalSampleError>` is raised automatically when consecutive failure tolerance is exceeded or when a system exception occurs during sample processing.
+
+## Built-in Error Handlers
+
+### `log_exception`
+
+Logs detailed error information and continues:
+- Exception traceback
+- Source information (dataset path, shard, index)
+- Sample details in readable format
+
+```python
+from megatron.energon import log_exception
+
+worker_config = WorkerConfig(
+    rank=0,
+    world_size=1,
+    num_workers=4,
+    global_error_handler=log_exception,
+)
+```
+
+### `reraise_exception`
+
+Immediately reraises the exception to halt iteration:
+
+```python
+from megatron.energon import reraise_exception
+
+worker_config = WorkerConfig(
+    rank=0,
+    world_size=1,
+    num_workers=4,
+    global_error_handler=reraise_exception,  # Fail on any error
+)
+```
+
+### Custom Error Handlers
+
+Implement custom error handlers with this signature:
+
+```python
+def my_error_handler(
+    exception: Exception,
+    sample: Any,
+    sources: list[SourceInfo] | None
+) -> None:
+    # Log to your monitoring system
+    log_to_monitoring(exception, sample)
+
+    # Optionally reraise for critical errors
+    if isinstance(exception, CriticalError):
+        raise exception
+```
+
+```python
+worker_config = WorkerConfig(
+    rank=0,
+    world_size=1,
+    num_workers=4,
+    global_error_handler=my_error_handler,
+)
+```
+
+## Failure Tolerance for Task Encoder Functions
+
+By default, if more than 100 samples in a row raise an error, processing will fail with a {py:exc}`FatalSampleError <megatron.energon.FatalSampleError>`.
+
+For {py:class}`TaskEncoder <megatron.energon.TaskEncoder>` methods, configure this via the `@stateless` decorator:
+
+```python
+from megatron.energon import DefaultTaskEncoder, stateless
+
+class MyTaskEncoder(DefaultTaskEncoder):
+    @stateless(failure_tolerance=50)
+    def encode_sample(self, sample):
+        # Process sample - tolerates up to 50 consecutive failures
+        if sample.is_corrupted():
+            raise ValueError("Corrupted sample")
+        return sample
+
+    @stateless(restore_seeds=True, failure_tolerance=200)
+    def pack_selected_samples(self, samples):
+        # Packing with higher tolerance and deterministic randomness
+        return pack_samples(samples)
+```
+
+Set `failure_tolerance=0` to disable tolerance checking for a specific function.
+
+```{admonition} Note
+:class: important
+Tolerance limits count *consecutive* failures. A single successful sample resets the counter.
+```
+
+## Skip or Fail Explicitly
+
+Raise {py:exc}`SkipSample <megatron.energon.SkipSample>` to explicitly skip a sample without logging an error:
+
+```python
+from megatron.energon import SkipSample
+
+def process_sample(sample):
+    try:
+        ...
+    except MySpecificError:
+        raise SkipSample()
+    return sample
+```
+
+Raise {py:exc}`FatalSampleError <megatron.energon.FatalSampleError>` to cause immediate failure, bypassing the error handler:
+
+```python
+from megatron.energon import FatalSampleError
+
+def process_sample(sample):
+    try:
+        ...
+    except MyFatalError as e:
+        raise FatalSampleError.from_sample(sample, "Critical corruption detected") from e
+    return sample
+```
diff --git a/docs/source/api/modules_data.md b/docs/source/api/modules_data.md
@@ -24,3 +24,13 @@ SPDX-License-Identifier: BSD-3-Clause -->
     :undoc-members:
     :show-inheritance:
 ```
+
+
+# megatron.energon.media
+
+```{eval-rst}
+.. automodule:: megatron.energon.media
+    :members:
+    :undoc-members:
+    :show-inheritance:
+```