OpenEnergyPlatform · jh-RLI · Dec 3, 2025 · Jul 29, 2025 · Jul 29, 2025 · Jul 29, 2025
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -4,7 +4,8 @@ Changelog
 
 current
 --------------------
-*
+* Add the creation module and create entry: They implement yaml based metadata creation, provide template feature to keep metadata creation DRY, provide functionality to setup the metadata structure & generate metadata from existing sources like datapackages and csv files, provide functionality to create the full datapackage.json and save it to file [(#127)](https://github.com/rl-institut/super-repo/pull/127)
+
 
 1.1.0 (2025-03-25)
 --------------------

diff --git a/docs/create.md b/docs/create.md
@@ -0,0 +1,159 @@
+# OMI “Create” Entry Point
+
+This mini-guide explains how to use the **programmatic entry points** that turn your split YAML metadata (dataset + template + resources) into a single OEMetadata JSON document.
+
+> If you’re looking for how to author the YAML files and how templating works, see the main **Assembly Guide** in the `creation` module directory. This page just shows how to *call* the entry points.
+
+---
+
+## What it does
+
+The functions in `omi.create` wrap the full assembly pipeline:
+
+1. **Discover / load** your YAML parts (dataset, optional template, resources).
+2. **Apply the template** to each resource (deep merge; resource wins; keywords/topics/languages concatenate).
+3. **Generate & validate** the final OEMetadata JSON using the official schema (via `OEMetadataCreator`).
+4. **Write** the result to disk (`build_from_yaml`) or many results to a directory (`build_many_from_yaml`).
+
+---
+
+## API
+
+```python
+from omi.create import build_from_yaml, build_many_from_yaml
+```
+
+### `build_from_yaml(base_dir, dataset_id, output_file, *, index_file=None) -> None`
+
+Assemble **one** dataset and write `<output_file>` (JSON).
+
+* `base_dir` (`str | Path`): Root that contains:
+
+  * `datasets/<dataset_id>.dataset.yaml`
+  * `datasets/<dataset_id>.template.yaml` *(optional)*
+  * `resources/<dataset_id>/*.resource.yaml`
+* `dataset_id` (`str`): Logical dataset name (e.g. `"powerplants"`).
+* `output_file` (`str | Path`): Path to write the generated OEMetadata JSON.
+* `index_file` (`str | Path | None`): Optional explicit mapping file (`metadata_index.yaml`). If provided, paths are taken from the index instead of convention.
+
+### `build_many_from_yaml(base_dir, output_dir, *, dataset_ids=None, index_file=None) -> None`
+
+Assemble **multiple** datasets and write each as `<output_dir>/<dataset_id>.json`.
+
+* `base_dir` (`str | Path`): Same as above.
+* `output_dir` (`str | Path`): Destination directory for one JSON file per dataset.
+* `dataset_ids` (`list[str] | None`): Limit to specific datasets. If `None`, we:
+
+  * Use keys from `index_file` when provided, **else**
+  * Discover all `datasets/*.dataset.yaml` in `base_dir`.
+* `index_file` (`str | Path | None`): Optional `metadata_index.yaml`.
+
+---
+
+## Quick examples
+
+### One dataset (convention-based discovery)
+
+```python
+from omi.create import build_from_yaml
+
+build_from_yaml(
+    base_dir="./metadata",
+    dataset_id="powerplants",
+    output_file="./out/powerplants.json",
+)
+```
+
+Directory layout:
+
+```bash
+metadata/
+  datasets/
+    powerplants.dataset.yaml
+    powerplants.template.yaml     # optional
+  resources/
+    powerplants/
+      *.resource.yaml
+```
+
+### One dataset (explicit index)
+
+```python
+from omi.create import build_from_yaml
+
+build_from_yaml(
+    base_dir="./metadata",
+    dataset_id="powerplants",
+    output_file="./out/powerplants.json",
+    index_file="./metadata/metadata_index.yaml",
+)
+```
+
+### Many datasets (discover all)
+
+```python
+from omi.create import build_many_from_yaml
+
+build_many_from_yaml(
+    base_dir="./metadata",
+    output_dir="./out",
+)
+# writes ./out/<dataset_id>.json for each dataset found
+```
+
+### Many datasets (index + subset)
+
+```python
+from omi.create import build_many_from_yaml
+
+build_many_from_yaml(
+    base_dir="./metadata",
+    output_dir="./out",
+    dataset_ids=["powerplants", "households"],
+    index_file="./metadata/metadata_index.yaml",
+)
+```
+
+---
+
+## Notes & behavior
+
+* Output JSON is written with `indent=2` and **`ensure_ascii=False`** to preserve characters like `©`.
+* Validation happens via `OEMetadataCreator` using the official schema provided by `oemetadata` (imported through `omi.base.get_metadata_specification`).
+* If a dataset YAML is missing, `FileNotFoundError` is raised.
+* If schema validation fails, you’ll get an exception from `omi.validation`. Catch it where you call the entry point if you want to handle/report errors.
+
+---
+
+## Using in 3rd Party code like data pipelines
+
+```python
+from pathlib import Path
+from omi.create import build_from_yaml
+
+def build_oemetadata_callable(**context):
+    base = Path("/project/metadata")
+    out = Path("/project/metadata/out/powerplants.json")
+    build_from_yaml(base, "powerplants", out)
+    # optionally push to airflow XCom, publish, upload, etc.
+```
+
+---
+
+## Testing tips
+
+* For **unit tests** of `omi.create`, patch `omi.create.assemble_metadata_dict` / `assemble_many_metadata` and verify files are written.
+* For **integration tests**, put real example YAMLs under `tests/test_data/create/metadata/` and call `build_from_yaml` end-to-end.
+
+---
+
+## Troubleshooting
+
+* **“Dataset YAML not found”**
+  Check `base_dir/datasets/<dataset_id>.dataset.yaml` exists, or supply the correct `index_file`.
+
+* **Unicode characters appear escaped (`\u00a9`)**
+  Ensure you’re not re-writing the JSON elsewhere with `ensure_ascii=True`.
+
+* **Template not applied**
+  Confirm your template file name matches `<dataset_id>.template.yaml` (or is correctly referenced from the index), and the keys you expect to inherit aren’t already set in the resource (resource values win).
diff --git a/pyproject.toml b/pyproject.toml
@@ -78,3 +78,6 @@ unfixable = ["UP007", "I001"]
 "*/__init__.py" = [
   "D104",  # Missing docstring in public package
 ]
+
+[omi.scripts]
+omi = "omi.cli:main"
diff --git a/src/omi/cli.py b/src/omi/cli.py
@@ -1,29 +1,118 @@
 """
-Module that contains the command line app.
+Command line interface for OMI.
 
-Why does this file exist, and why not put this in __main__?
+This CLI only supports the split-files layout:
+- datasets/<dataset_id>.dataset.yaml
+- datasets/<dataset_id>.template.yaml  (optional)
+- resources/<dataset_id>/*.resource.yaml
+(optionally wired via metadata_index.yaml)
 
-  You might be tempted to import things from __main__ later, but that will cause
-  problems: the code will get executed twice:
+Usage:
+omi assemble \
+  --base-dir ./metadata \
+  --dataset-id powerplants \
+  --output-file ./out/powerplants.json \
+  --index-file ./metadata/metadata_index.yaml   # optional
 
-  - When you run `python -m omi` python will execute
-    ``__main__.py`` as a script. That means there won't be any
-    ``omi.__main__`` in ``sys.modules``.
-  - When you import __main__ it will get executed again (as a module) because
-    there's no ``omi.__main__`` in ``sys.modules``.
-
-  Also see (1) from http://click.pocoo.org/5/setuptools/#setuptools-integration
 """
 
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Optional
+
 import click
 
+from omi.creation.creator import OEMetadataCreator
+from omi.creation.init import init_dataset, init_resources_from_files
+from omi.creation.utils import apply_template_to_resources, load_parts
+
 
 @click.group()
 def grp() -> None:
-    """Init click group."""
+    """OMI CLI."""
+
+
+@grp.command("assemble")
+@click.option(
+    "--base-dir",
+    required=True,
+    type=click.Path(file_okay=False, path_type=Path),
+    help="Root directory containing 'datasets/' and 'resources/'.",
+)
+@click.option("--dataset-id", required=True, help="Logical dataset id (e.g. 'powerplants').")
+@click.option(
+    "--output-file",
+    required=True,
+    type=click.Path(dir_okay=False, path_type=Path),
+    help="Path to write the generated OEMetadata JSON.",
+)
+@click.option(
+    "--index-file",
+    default=None,
+    type=click.Path(dir_okay=False, path_type=Path),
+    help="Optional metadata index YAML for explicit mapping.",
+)
+def assemble_cmd(base_dir: Path, dataset_id: str, output_file: Path, index_file: Optional[Path]) -> None:
+    """Assemble OEMetadata from split YAML files and write JSON to OUTPUT_FILE."""
+    # Load pieces
+    version, dataset, resources, template = load_parts(base_dir, dataset_id, index_file=index_file)
+    merged_resources = apply_template_to_resources(resources, template)
+
+    # Build & save with the correct spec version
+    creator = OEMetadataCreator(oem_version=version)
+    creator.save(dataset, merged_resources, output_file, ensure_ascii=False, indent=2)
+
+
+@click.group()
+def init() -> None:
+    """Scaffold OEMetadata split-files layout."""
+
+
+@init.command("dataset")
+@click.argument("base_dir", type=click.Path(file_okay=False, path_type=Path))
+@click.argument("dataset_id")
+@click.option("--oem-version", default="OEMetadata-2.0", show_default=True)
+@click.option("--resource", "resources", multiple=True, help="Initial resource names (repeatable).")
+@click.option("--overwrite", is_flag=True, help="Overwrite existing files.")
+def init_dataset_cmd(
+    base_dir: Path,
+    dataset_id: str,
+    oem_version: str,
+    resources: tuple[str, ...],
+    *,
+    overwrite: bool,
+) -> None:
+    """Initialize a split-files OEMetadata dataset layout under BASE_DIR."""
+    res = init_dataset(base_dir, dataset_id, oem_version=oem_version, resources=resources, overwrite=overwrite)
+    click.echo(f"dataset:  {res.dataset_yaml}")
+    click.echo(f"template: {res.template_yaml}")
+    for p in res.resource_yamls:
+        click.echo(f"resource: {p}")
+
+
+@init.command("resources")
+@click.argument("base_dir", type=click.Path(file_okay=False, path_type=Path))
+@click.argument("dataset_id")
+@click.argument("files", nargs=-1, type=click.Path(exists=True, dir_okay=False, path_type=Path))
+@click.option("--oem-version", default="OEMetadata-2.0", show_default=True)
+@click.option("--overwrite", is_flag=True, help="Overwrite existing files.")
+def init_resources_cmd(
+    base_dir: Path,
+    dataset_id: str,
+    files: tuple[Path, ...],
+    oem_version: str,
+    *,
+    overwrite: bool,
+) -> None:
+    """Create resource YAML files for DATASET_ID from the given FILES."""
+    outs = init_resources_from_files(base_dir, dataset_id, files, oem_version=oem_version, overwrite=overwrite)
+    for p in outs:
+        click.echo(p)
 
 
-cli = click.CommandCollection(sources=[grp])
+# Keep CommandCollection for backwards compatibility with your entry point
+cli = click.CommandCollection(sources=[grp, init])
 
 
 def main() -> None:

diff --git a/src/omi/create.py b/src/omi/create.py
@@ -0,0 +1,75 @@
+"""Entry point for OEMetadata creation (split-files layout only)."""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Optional, Union
+
+from omi.creation.assembler import assemble_many_metadata, assemble_metadata_dict
+
+
+def build_from_yaml(
+    base_dir: Union[str, Path],
+    dataset_id: str,
+    output_file: Union[str, Path],
+    *,
+    index_file: Optional[Union[str, Path]] = None,
+) -> None:
+    """
+    Assemble one dataset and write the resulting OEMetadata JSON to a file.
+
+    Parameters
+    ----------
+    base_dir : Union[str, Path]
+        Base directory containing the split-files dataset structure.
+    dataset_id : str
+        The dataset ID to assemble.
+    output_file : Union[str, Path]
+        Path to write the resulting OEMetadata JSON file.
+    index_file : Optional[Union[str, Path]], optional
+        Optional path to an index file for resolving cross-dataset references,
+        by default None.
+    """
+    md = assemble_metadata_dict(base_dir, dataset_id, index_file=index_file)
+    Path(output_file).parent.mkdir(parents=True, exist_ok=True)
+    Path(output_file).write_text(json.dumps(md, indent=2, ensure_ascii=False), encoding="utf-8")
+
+
+def build_many_from_yaml(
+    base_dir: Union[str, Path],
+    output_dir: Union[str, Path],
+    *,
+    dataset_ids: Optional[list[str]] = None,
+    index_file: Optional[Union[str, Path]] = None,
+) -> None:
+    """
+    Assemble multiple datasets and write each as <dataset_id>.json to output_dir.
+
+    Parameters
+    ----------
+    base_dir : Union[str, Path]
+        Base directory containing the split-files dataset structure.
+    output_dir : Union[str, Path]
+        Directory to write the resulting OEMetadata JSON files.
+    dataset_ids : Optional[list[str]], optional
+        Optional list of dataset IDs to assemble. If None, all datasets found
+        in base_dir will be assembled, by default None.
+    index_file : Optional[Union[str, Path]], optional
+        Optional path to an index file for resolving cross-dataset references,
+        by default None.
+    """
+    out_dir = Path(output_dir)
+    out_dir.mkdir(parents=True, exist_ok=True)
+
+    results = assemble_many_metadata(
+        base_dir,
+        dataset_ids=dataset_ids,
+        index_file=index_file,
+        as_dict=True,  # keep it as a mapping id -> metadata
+    )
+    for ds_id, md in results.items():
+        (out_dir / f"{ds_id}.json").write_text(
+            json.dumps(md, indent=2, ensure_ascii=False),
+            encoding="utf-8",
+        )
-Original file line number
+Diff line change
@@ Expand Up / @@ -4,7 +4,8 @@ Changelog @@
     current
     --------------------
-    *
+    * Add the creation module and create entry: They implement yaml based metadata creation, provide template feature to keep metadata creation DRY, provide functionality to setup the metadata structure & generate metadata from existing sources like datapackages and csv files, provide functionality to create the full datapackage.json and save it to file [(#127)](https://github.com/rl-institut/super-repo/pull/127)
 .1.0 (2025-03-25)
     --------------------
@@ Expand Down @@