Skip to content

Conversation

@jh-RLI
Copy link

@jh-RLI jh-RLI commented Aug 7, 2025

Fixes #1177 and adds #1305 as well as reworks the current metadata implementation and airflow pipeline integration.

to get started this is a larger PR ;)

  • currently requires python 3.10 for merge

Before merging into dev-branch, please make sure that

  • the CHANGELOG.rst was updated.
  • new and adjusted code is formated using black and isort.
  • the Dataset-version is updated when existing datasets are adjusted.
  • the workflow is running successful in test mode.
  • the workflow is running successful in Everything mode.

Closes #1177
Closes #1305

…already part of the metadata specification
@jh-RLI jh-RLI self-assigned this Aug 7, 2025
jh-RLI added 3 commits August 7, 2025 13:18
…ed to v2, old metadata is not fully supported anymore (<v1.5)
- Add settings module to manage general settings and specific input values which are important for the metadata integration
- Update to latest OMI ... using local python 3.10 venv for now
@jh-RLI
Copy link
Author

jh-RLI commented Aug 7, 2025

Currently, i have to work with a local venv using Python 3.10 as OMI and recent oemetadata do not support py<3.10 currently

@jh-RLI
Copy link
Author

jh-RLI commented Aug 13, 2025

After working on this for some time, i came to the conclusion to move this code later on to omi. It seems like the egon-data python version will be updated at some point.

The implementation went into the direction of a OemetadataBuilder tool. It is an implementation of the oemetadata v2 structure using classes to manage single resource (dataset) and full datapackage metadata (all datasets in a collection). It relies on omi validation to make sure the input provides valid structure and can be used to generate the required properties for valid oemetadata.

Dev´s can create YAML files collected in the new overlays module. They should be stored in a new directory, which is named after the dataset name to keep it tidy. Later on the template functionality is soon available in omi (see PR there). Then it becomes possible to reduce repeated information which is used statically across all resources. Using the template system, this kind of metadata can be added based on a single YAML definition.

It can be added to the tasks of a dataset which will trigger an instance of the ResourceBuilder and the OemetadataBuilder as a step in the DAG pipeline definition which collects all Resources. Each table still can use the SQL comment on table to store the metadata. We could also change this approach and store oemetadata in a JSONB column as part of a new model with FK relation to the table resource.

Another step would be the publishing to the OEP. But this is another PR

jh-RLI added 4 commits August 13, 2025 21:11
…mplementation a bit more stable until it is teared down
- Builder to generate, create and customize oemetadata for egon-data datasets
- Implement draft builder tool -> will be enhanced further by using mixins to reduce class complexity
- The oemetadataBuilder will become a more complex module and spit into a resource and datapackage builder
@jh-RLI
Copy link
Author

jh-RLI commented Aug 14, 2025

FYI @CarlosEpia

I went further down the road to update egon-data to python 3.10. Locally i can now run the pipeline but some tasks are failing so i assume there need to be more updates:

  • as i used uv to setup everything (including python install) this seems to solve some issues with dependencies
  • Mainly, we have to update to a new major airflow version 2
  • The package version is important to keep using the PostgresOperator but it must update to the import location below
    • apache-airflow-providers-postgres==5.14.0
    • from airflow.providers.postgres.operators.postgres import PostgresOperator
  • Look at https://shapely.readthedocs.io/en/stable/migration.html#other-deprecated-functionality for shapely update
    • switch to shapely.ops.unary_union
  • Then i was able to install some packages manually (just uv pip install package again, which first did not succeed but did not document :(

This is my current env setup:

$> uv pip list

Using Python 3.10.12 environment at: .venv_py310
Package Version Editable project location

affine 2.4.0
aiohappyeyeballs 2.6.1
aiohttp 3.12.15
aiosignal 1.4.0
alembic 1.16.4
annotated-types 0.7.0
anyio 4.10.0
apache-airflow 2.11.0
apache-airflow-providers-common-compat 1.7.3
apache-airflow-providers-common-io 1.6.2
apache-airflow-providers-common-sql 1.27.4
apache-airflow-providers-fab 1.5.3
apache-airflow-providers-ftp 3.13.2
apache-airflow-providers-http 5.3.3
apache-airflow-providers-imap 3.9.2
apache-airflow-providers-postgres 5.14.0
apache-airflow-providers-sendgrid 4.1.3
apache-airflow-providers-smtp 2.1.2
apache-airflow-providers-sqlite 4.1.2
apispec 6.8.2
appdirs 1.4.4
argcomplete 3.6.2
asgiref 3.9.1
async-timeout 5.0.1
asyncpg 0.30.0
atlite 0.2.11
attrs 25.3.0
babel 2.17.0
beautifulsoup4 4.13.4
black 25.1.0
blinker 1.9.0
blosc2 3.6.1
bottleneck 1.5.0
cachelib 0.13.0
cdsapi 0.7.6
certifi 2025.8.3
cffi 1.17.1
cftime 1.6.4.post1
chardet 5.2.0
charset-normalizer 3.4.2
click 8.2.1
click-plugins 1.1.1.2
clickclick 20.10.2
cligj 0.7.2
cloudpickle 3.1.1
colorama 0.4.6
colorlog 6.9.0
configargparse 1.7.1
configupdater 3.2
connection-pool 0.0.3
connexion 2.14.2
contourpy 1.3.2
cron-descriptor 1.4.5
croniter 6.0.0
cryptography 45.0.6
cycler 0.12.1
dask 2023.10.1
datrie 0.8.2
deprecated 1.2.18
deprecation 2.1.0
descartes 1.1.0
dill 0.4.0
disaggregator 0.0.0
dnspython 2.7.0
docutils 0.22
ecdsa 0.19.1
ecmwf-datastores-client 0.4.0
egon-data 1.0.0 /home/jh/projekte/reGon/code/eGon-data
email-validator 2.2.0
entsoe-py 0.7.1
et-xmlfile 2.0.0
exceptiongroup 1.3.0
fastjsonschema 2.21.1
filelock 3.18.0
fiona 1.9.6
flake8 7.3.0
flask 2.2.5
flask-appbuilder 4.5.3
flask-babel 2.0.0
flask-caching 2.3.1
flask-jwt-extended 4.7.1
flask-limiter 3.12
flask-login 0.6.3
flask-session 0.5.0
flask-sqlalchemy 2.5.1
flask-wtf 1.2.2
fonttools 4.59.0
frictionless 5.18.1
frozenlist 1.7.0
fsspec 2025.7.0
fuzzywuzzy 0.18.0
geoalchemy2 0.6.3
geographiclib 2.0
geopandas 1.1.1
geopy 2.4.1
geovoronoi 0.4.0
gitdb 4.0.12
gitpython 3.1.45
google-re2 1.1.20250805
googleapis-common-protos 1.70.0
greenlet 3.2.3
grpcio 1.74.0
gunicorn 23.0.0
h11 0.16.0
holidays 0.78
httpcore 1.0.9
httpx 0.28.1
humanize 4.12.3
idna 3.10
importlib-metadata 8.7.0
importlib-resources 5.13.0
inflection 0.5.1
iniconfig 2.1.0
isodate 0.7.2
isort 6.0.1
itsdangerous 2.2.0
jinja2 3.1.6
jmespath 1.0.1
jsonschema 4.25.0
jsonschema-specifications 2025.4.1
jupyter-core 5.8.1
kiwisolver 1.4.8
lazy-object-proxy 1.11.0
limits 5.5.0
linkify-it-py 2.0.3
locket 1.0.0
lockfile 0.12.2
loguru 0.7.3
mako 1.3.10
markdown-it-py 3.0.0
marko 2.1.4
markupsafe 3.0.2
marshmallow 3.26.1
marshmallow-oneofschema 3.2.0
marshmallow-sqlalchemy 0.28.2
matplotlib 3.10.5
mccabe 0.7.0
mdit-py-plugins 0.4.2
mdurl 0.1.2
methodtools 0.4.7
more-itertools 10.7.0
msgpack 1.1.1
multidict 6.6.3
multiurl 0.3.7
mypy-extensions 1.1.0
nbformat 5.10.4
ndindex 1.10.0
netcdf4 1.7.2
networkx 3.4.2
numexpr 2.11.0
numpy 2.2.6
oedialect 0.0.8
oemetadata 2.0.4
omi 1.1.0
openpyxl 3.1.0
opentelemetry-api 1.36.0
opentelemetry-exporter-otlp 1.36.0
opentelemetry-exporter-otlp-proto-common 1.36.0
opentelemetry-exporter-otlp-proto-grpc 1.36.0
opentelemetry-exporter-otlp-proto-http 1.36.0
opentelemetry-proto 1.36.0
opentelemetry-sdk 1.36.0
opentelemetry-semantic-conventions 0.57b0
ordered-set 4.1.0
packaging 25.0
pandas 2.3.1
partd 1.4.2
pathspec 0.12.1
pendulum 3.1.0
petl 1.7.17
pillow 11.3.0
platformdirs 4.3.8
pluggy 1.6.0
ply 3.11
prison 0.2.1
progressbar2 4.5.0
propcache 0.3.2
protobuf 6.31.1
psutil 7.0.0
psycopg 3.2.9
psycopg-binary 3.2.9
psycopg2 2.9.10
psycopg2-binary 2.9.10
pulp 2.7.0
py-cpuinfo 9.0.0
pyaml 25.7.0
pycodestyle 2.14.0
pycparser 2.22
pydantic 2.11.7
pydantic-core 2.33.2
pyflakes 3.4.0
pygments 2.19.2
pyjwt 2.10.1
pyogrio 0.11.1
pyomo 6.9.3
pyparsing 3.2.3
pyproj 3.7.1
pypsa 0.20.1
pytest 8.4.1
python-daemon 3.1.2
python-dateutil 2.9.0.post0
python-http-client 3.3.7
python-nvd3 0.16.0
python-slugify 8.0.4
python-utils 3.9.1
pytz 2025.2
pyyaml 6.0.2
rasterio 1.4.3
ratelimiter 1.2.0.post0
rdflib 7.1.4
referencing 0.36.2
requests 2.32.4
requests-toolbelt 1.0.0
rfc3339-validator 0.1.4
rfc3986 2.0.0
rich 13.9.4
rich-argparse 1.7.1
rioxarray 0.19.0
rpds-py 0.27.0
rtree 1.4.0
ruamel-yaml 0.17.40
ruamel-yaml-clib 0.2.12
saio 0.2.1
scipy 1.15.3
seaborn 0.13.2
sendgrid 6.12.4
setproctitle 1.3.6
setuptools 80.9.0
shapely 2.1.1
shellingham 1.5.4
simpleeval 1.0.3
six 1.17.0
smart-open 7.3.0.post1
smmap 5.0.2
snakemake 6.15.5
sniffio 1.3.1
soupsieve 2.7
sqlalchemy 1.4.54
sqlalchemy-jsonfield 1.0.2
sqlalchemy-utils 0.41.2
sqlparse 0.5.3
stopit 1.1.2
stringcase 1.2.0
tables 3.10.1
tabulate 0.9.0
tenacity 9.1.2
termcolor 3.1.0
text-unidecode 1.3
tomli 2.2.1
toolz 1.0.0
toposort 1.10
tqdm 4.67.1
traitlets 5.14.3
typer 0.16.0
typing-extensions 4.14.1
typing-inspection 0.4.1
tzdata 2025.2
uc-micro-py 1.0.3
universal-pathlib 0.2.6
urllib3 2.5.0
validators 0.35.0
werkzeug 2.2.3
wirerope 1.0.0
wrapt 1.17.2
wtforms 3.2.1
xarray 2025.6.1
xlrd 2.0.2
yarl 1.20.1
zipp 3.23.0

… submit comment functionality

- Also update path that reads the static oemetadata.json files stored in the metadata/results directory
- Update metadataVersion to at least v1.5.2, omi does not support older version
add testing / zensus dags to gitignore
…ch works for the new metadata module and new omi functionality
… to the exsisting egon-data metadata JSON files
- remove string wrappers required for older versions of omi (used to parse metadata)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Implement OMI based metadata validation Update to OEMetadata Standard v2

2 participants