Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
34988f5
#126: add empty test for metadata yaml generation
jh-RLI Jul 29, 2025
720cb12
#126: Implement oemetadata creator class to create valid oemetadata j…
jh-RLI Jul 29, 2025
ad0fe47
#126: Add utility module which offers general purpose functionality. …
jh-RLI Jul 29, 2025
a722b0b
#126: setup creation module
jh-RLI Jul 29, 2025
63b2a30
#126: add test for creation functionality
jh-RLI Jul 29, 2025
c0f559f
#126: add generator for functionality to generate metadata. Currently…
jh-RLI Jul 29, 2025
7ad24dc
#126: make linter happy
jh-RLI Jul 29, 2025
6a51fbc
#126: add entry point for metadata creation with function to create J…
jh-RLI Jul 29, 2025
ac555f6
#126: add cli function to create metadata json from yaml file
jh-RLI Jul 29, 2025
107ab33
#126: make sure when inspecting data resources to infer the fields me…
jh-RLI Jul 29, 2025
2191eac
#126: Add documentation for the creation module
jh-RLI Nov 6, 2025
79da3a2
#126: Add test for yaml based oemetadata layout assembly module
jh-RLI Nov 6, 2025
a902885
#126: Add test for yaml based oemetadata creation -> as dict or save …
jh-RLI Nov 6, 2025
8f5fc0c
#126: Add todo to extend inspection tests
jh-RLI Nov 6, 2025
1fdd91e
#126: Add test for yaml based oemetadata creation -> as dict or save …
jh-RLI Nov 6, 2025
2c18055
#126: Move all utility functionality in creation module here.
jh-RLI Nov 6, 2025
1494928
#126: Update create entrypoint to build oemetadata form yaml parts (d…
jh-RLI Nov 6, 2025
7239484
#126: Rename test for assembler and add test case to check if assembl…
jh-RLI Nov 6, 2025
5567f73
#126: Add assembler module which handles the assembling of yaml file …
jh-RLI Nov 6, 2025
eeb9c27
#126: update cli functionality to include omi creation module
jh-RLI Nov 6, 2025
5f4d3fc
#126: add method to save generated metadata to file
jh-RLI Nov 6, 2025
d4e285f
#126: Update docs
jh-RLI Nov 6, 2025
2477b1b
#126: Update the create module as entry point for the oemetadata crea…
jh-RLI Nov 6, 2025
7070d3b
#126: Add test data for "create" integration test
jh-RLI Nov 6, 2025
666242b
#126: Add test for creation module entry point "create" as integratio…
jh-RLI Nov 6, 2025
41aafda
#126: Add docs on how to use the create module (entry point for creat…
jh-RLI Nov 6, 2025
70435e9
deactivate test
jh-RLI Nov 6, 2025
59f4263
remove irritating info from example resource name
jh-RLI Nov 6, 2025
b37ecf0
#126: Update create docs
jh-RLI Nov 19, 2025
c269469
#126: Add CLI command to initialize a new metadata workspace with tem…
jh-RLI Nov 19, 2025
a4bedf2
#126: add omi scripts to project
jh-RLI Nov 19, 2025
b1dbbf8
#126: enhance docstring
jh-RLI Nov 19, 2025
47117b0
#126: Add creation init module to provide backend for CLI functionali…
jh-RLI Nov 19, 2025
2169357
#126: enhance docstings
jh-RLI Nov 19, 2025
90f4ae0
#126: add more test to creation test module
jh-RLI Nov 19, 2025
1b2a38f
126: fix test
jh-RLI Nov 19, 2025
32b3b53
#126: update changelog
jh-RLI Dec 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ Changelog

current
--------------------
*
* Add the creation module and create entry: They implement yaml based metadata creation, provide template feature to keep metadata creation DRY, provide functionality to setup the metadata structure & generate metadata from existing sources like datapackages and csv files, provide functionality to create the full datapackage.json and save it to file [(#127)](https://github.com/rl-institut/super-repo/pull/127)


1.1.0 (2025-03-25)
--------------------
Expand Down
159 changes: 159 additions & 0 deletions docs/create.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# OMI “Create” Entry Point

This mini-guide explains how to use the **programmatic entry points** that turn your split YAML metadata (dataset + template + resources) into a single OEMetadata JSON document.

> If you’re looking for how to author the YAML files and how templating works, see the main **Assembly Guide** in the `creation` module directory. This page just shows how to *call* the entry points.

---

## What it does

The functions in `omi.create` wrap the full assembly pipeline:

1. **Discover / load** your YAML parts (dataset, optional template, resources).
2. **Apply the template** to each resource (deep merge; resource wins; keywords/topics/languages concatenate).
3. **Generate & validate** the final OEMetadata JSON using the official schema (via `OEMetadataCreator`).
4. **Write** the result to disk (`build_from_yaml`) or many results to a directory (`build_many_from_yaml`).

---

## API

```python
from omi.create import build_from_yaml, build_many_from_yaml
```

### `build_from_yaml(base_dir, dataset_id, output_file, *, index_file=None) -> None`

Assemble **one** dataset and write `<output_file>` (JSON).

* `base_dir` (`str | Path`): Root that contains:

* `datasets/<dataset_id>.dataset.yaml`
* `datasets/<dataset_id>.template.yaml` *(optional)*
* `resources/<dataset_id>/*.resource.yaml`
* `dataset_id` (`str`): Logical dataset name (e.g. `"powerplants"`).
* `output_file` (`str | Path`): Path to write the generated OEMetadata JSON.
* `index_file` (`str | Path | None`): Optional explicit mapping file (`metadata_index.yaml`). If provided, paths are taken from the index instead of convention.

### `build_many_from_yaml(base_dir, output_dir, *, dataset_ids=None, index_file=None) -> None`

Assemble **multiple** datasets and write each as `<output_dir>/<dataset_id>.json`.

* `base_dir` (`str | Path`): Same as above.
* `output_dir` (`str | Path`): Destination directory for one JSON file per dataset.
* `dataset_ids` (`list[str] | None`): Limit to specific datasets. If `None`, we:

* Use keys from `index_file` when provided, **else**
* Discover all `datasets/*.dataset.yaml` in `base_dir`.
* `index_file` (`str | Path | None`): Optional `metadata_index.yaml`.

---

## Quick examples

### One dataset (convention-based discovery)

```python
from omi.create import build_from_yaml

build_from_yaml(
base_dir="./metadata",
dataset_id="powerplants",
output_file="./out/powerplants.json",
)
```

Directory layout:

```bash
metadata/
datasets/
powerplants.dataset.yaml
powerplants.template.yaml # optional
resources/
powerplants/
*.resource.yaml
```

### One dataset (explicit index)

```python
from omi.create import build_from_yaml

build_from_yaml(
base_dir="./metadata",
dataset_id="powerplants",
output_file="./out/powerplants.json",
index_file="./metadata/metadata_index.yaml",
)
```

### Many datasets (discover all)

```python
from omi.create import build_many_from_yaml

build_many_from_yaml(
base_dir="./metadata",
output_dir="./out",
)
# writes ./out/<dataset_id>.json for each dataset found
```

### Many datasets (index + subset)

```python
from omi.create import build_many_from_yaml

build_many_from_yaml(
base_dir="./metadata",
output_dir="./out",
dataset_ids=["powerplants", "households"],
index_file="./metadata/metadata_index.yaml",
)
```

---

## Notes & behavior

* Output JSON is written with `indent=2` and **`ensure_ascii=False`** to preserve characters like `©`.
* Validation happens via `OEMetadataCreator` using the official schema provided by `oemetadata` (imported through `omi.base.get_metadata_specification`).
* If a dataset YAML is missing, `FileNotFoundError` is raised.
* If schema validation fails, you’ll get an exception from `omi.validation`. Catch it where you call the entry point if you want to handle/report errors.

---

## Using in 3rd Party code like data pipelines

```python
from pathlib import Path
from omi.create import build_from_yaml

def build_oemetadata_callable(**context):
base = Path("/project/metadata")
out = Path("/project/metadata/out/powerplants.json")
build_from_yaml(base, "powerplants", out)
# optionally push to airflow XCom, publish, upload, etc.
```

---

## Testing tips

* For **unit tests** of `omi.create`, patch `omi.create.assemble_metadata_dict` / `assemble_many_metadata` and verify files are written.
* For **integration tests**, put real example YAMLs under `tests/test_data/create/metadata/` and call `build_from_yaml` end-to-end.

---

## Troubleshooting

* **“Dataset YAML not found”**
Check `base_dir/datasets/<dataset_id>.dataset.yaml` exists, or supply the correct `index_file`.

* **Unicode characters appear escaped (`\u00a9`)**
Ensure you’re not re-writing the JSON elsewhere with `ensure_ascii=True`.

* **Template not applied**
Confirm your template file name matches `<dataset_id>.template.yaml` (or is correctly referenced from the index), and the keys you expect to inherit aren’t already set in the resource (resource values win).
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -78,3 +78,6 @@ unfixable = ["UP007", "I001"]
"*/__init__.py" = [
"D104", # Missing docstring in public package
]

[omi.scripts]
omi = "omi.cli:main"
115 changes: 102 additions & 13 deletions src/omi/cli.py
Original file line number Diff line number Diff line change
@@ -1,29 +1,118 @@
"""
Module that contains the command line app.
Command line interface for OMI.

Why does this file exist, and why not put this in __main__?
This CLI only supports the split-files layout:
- datasets/<dataset_id>.dataset.yaml
- datasets/<dataset_id>.template.yaml (optional)
- resources/<dataset_id>/*.resource.yaml
(optionally wired via metadata_index.yaml)

You might be tempted to import things from __main__ later, but that will cause
problems: the code will get executed twice:
Usage:
omi assemble \
--base-dir ./metadata \
--dataset-id powerplants \
--output-file ./out/powerplants.json \
--index-file ./metadata/metadata_index.yaml # optional

- When you run `python -m omi` python will execute
``__main__.py`` as a script. That means there won't be any
``omi.__main__`` in ``sys.modules``.
- When you import __main__ it will get executed again (as a module) because
there's no ``omi.__main__`` in ``sys.modules``.

Also see (1) from http://click.pocoo.org/5/setuptools/#setuptools-integration
"""

from __future__ import annotations

from pathlib import Path
from typing import Optional

import click

from omi.creation.creator import OEMetadataCreator
from omi.creation.init import init_dataset, init_resources_from_files
from omi.creation.utils import apply_template_to_resources, load_parts


@click.group()
def grp() -> None:
"""Init click group."""
"""OMI CLI."""


@grp.command("assemble")
@click.option(
"--base-dir",
required=True,
type=click.Path(file_okay=False, path_type=Path),
help="Root directory containing 'datasets/' and 'resources/'.",
)
@click.option("--dataset-id", required=True, help="Logical dataset id (e.g. 'powerplants').")
@click.option(
"--output-file",
required=True,
type=click.Path(dir_okay=False, path_type=Path),
help="Path to write the generated OEMetadata JSON.",
)
@click.option(
"--index-file",
default=None,
type=click.Path(dir_okay=False, path_type=Path),
help="Optional metadata index YAML for explicit mapping.",
)
def assemble_cmd(base_dir: Path, dataset_id: str, output_file: Path, index_file: Optional[Path]) -> None:
"""Assemble OEMetadata from split YAML files and write JSON to OUTPUT_FILE."""
# Load pieces
version, dataset, resources, template = load_parts(base_dir, dataset_id, index_file=index_file)
merged_resources = apply_template_to_resources(resources, template)

# Build & save with the correct spec version
creator = OEMetadataCreator(oem_version=version)
creator.save(dataset, merged_resources, output_file, ensure_ascii=False, indent=2)


@click.group()
def init() -> None:
"""Scaffold OEMetadata split-files layout."""


@init.command("dataset")
@click.argument("base_dir", type=click.Path(file_okay=False, path_type=Path))
@click.argument("dataset_id")
@click.option("--oem-version", default="OEMetadata-2.0", show_default=True)
@click.option("--resource", "resources", multiple=True, help="Initial resource names (repeatable).")
@click.option("--overwrite", is_flag=True, help="Overwrite existing files.")
def init_dataset_cmd(
base_dir: Path,
dataset_id: str,
oem_version: str,
resources: tuple[str, ...],
*,
overwrite: bool,
) -> None:
"""Initialize a split-files OEMetadata dataset layout under BASE_DIR."""
res = init_dataset(base_dir, dataset_id, oem_version=oem_version, resources=resources, overwrite=overwrite)
click.echo(f"dataset: {res.dataset_yaml}")
click.echo(f"template: {res.template_yaml}")
for p in res.resource_yamls:
click.echo(f"resource: {p}")


@init.command("resources")
@click.argument("base_dir", type=click.Path(file_okay=False, path_type=Path))
@click.argument("dataset_id")
@click.argument("files", nargs=-1, type=click.Path(exists=True, dir_okay=False, path_type=Path))
@click.option("--oem-version", default="OEMetadata-2.0", show_default=True)
@click.option("--overwrite", is_flag=True, help="Overwrite existing files.")
def init_resources_cmd(
base_dir: Path,
dataset_id: str,
files: tuple[Path, ...],
oem_version: str,
*,
overwrite: bool,
) -> None:
"""Create resource YAML files for DATASET_ID from the given FILES."""
outs = init_resources_from_files(base_dir, dataset_id, files, oem_version=oem_version, overwrite=overwrite)
for p in outs:
click.echo(p)


cli = click.CommandCollection(sources=[grp])
# Keep CommandCollection for backwards compatibility with your entry point
cli = click.CommandCollection(sources=[grp, init])


def main() -> None:
Expand Down
75 changes: 75 additions & 0 deletions src/omi/create.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
"""Entry point for OEMetadata creation (split-files layout only)."""

from __future__ import annotations

import json
from pathlib import Path
from typing import Optional, Union

from omi.creation.assembler import assemble_many_metadata, assemble_metadata_dict


def build_from_yaml(
base_dir: Union[str, Path],
dataset_id: str,
output_file: Union[str, Path],
*,
index_file: Optional[Union[str, Path]] = None,
) -> None:
"""
Assemble one dataset and write the resulting OEMetadata JSON to a file.

Parameters
----------
base_dir : Union[str, Path]
Base directory containing the split-files dataset structure.
dataset_id : str
The dataset ID to assemble.
output_file : Union[str, Path]
Path to write the resulting OEMetadata JSON file.
index_file : Optional[Union[str, Path]], optional
Optional path to an index file for resolving cross-dataset references,
by default None.
"""
md = assemble_metadata_dict(base_dir, dataset_id, index_file=index_file)
Path(output_file).parent.mkdir(parents=True, exist_ok=True)
Path(output_file).write_text(json.dumps(md, indent=2, ensure_ascii=False), encoding="utf-8")


def build_many_from_yaml(
base_dir: Union[str, Path],
output_dir: Union[str, Path],
*,
dataset_ids: Optional[list[str]] = None,
index_file: Optional[Union[str, Path]] = None,
) -> None:
"""
Assemble multiple datasets and write each as <dataset_id>.json to output_dir.

Parameters
----------
base_dir : Union[str, Path]
Base directory containing the split-files dataset structure.
output_dir : Union[str, Path]
Directory to write the resulting OEMetadata JSON files.
dataset_ids : Optional[list[str]], optional
Optional list of dataset IDs to assemble. If None, all datasets found
in base_dir will be assembled, by default None.
index_file : Optional[Union[str, Path]], optional
Optional path to an index file for resolving cross-dataset references,
by default None.
"""
out_dir = Path(output_dir)
out_dir.mkdir(parents=True, exist_ok=True)

results = assemble_many_metadata(
base_dir,
dataset_ids=dataset_ids,
index_file=index_file,
as_dict=True, # keep it as a mapping id -> metadata
)
for ds_id, md in results.items():
(out_dir / f"{ds_id}.json").write_text(
json.dumps(md, indent=2, ensure_ascii=False),
encoding="utf-8",
)
Loading