Skip to content

stelar-eu/klms-deploy

Repository files navigation

Overview

This repository provides instructions, documentation, and examples regarding deployment of the Knowledge Lake Management System (KLMS) developed by the STELAR project. The STELAR KLMS supports and facilitates a holistic approach for FAIR (Findable, Accessible, Interoperable, Reusable) and AI-ready (high-quality, reliably labeled) data. It allows to (semi-)automatically turn a raw data lake into a knowledge lake by: (a) enhancing the data lake with a knowledge layer; and (b) developing and integrating a set of data management tools and workflows. The knowledge layer comprises: (a) a data catalog that offers automatically enhanced metadata for the raw data assets in the lake; and (b) a knowledge graph that semantically describes and interlinks these data assets using suitable domain ontologies and vocabularies. The provided STELAR tools and workflows offer novel functionalities for: (a) data discovery and quality management; (b) data linking and alignment, and (c) data annotation and synthetic data generation.

alt text

KLMS core components

  • STELAR API. The main entry point to the KLMS system, exposing RESTful endpoints for managing and searching resources in the KLMS. Houses the core services of the KLMS, including user management, dataset management, metadata extraction, and search functionalities, task and workflow invocation. Exposes a GUI for interacting with the KLMS system, the STELAR KLMS Console, supporting the full spectrum of KLMS functionalities.

  • Data Catalog of datasets in KLMS, deployed as a CKAN instance, mainly utilized under the hood.

  • Keycloak is used for Identity and Access Management;

  • PostgreSQL serves as the main relational database backbone for storing KLMS metadata and user information.

  • Ontop a knowledge graph, employing mappings from the database to a virtual RDF graph according to the KLMS ontology.

  • QUAY Registry, via a custom distribution for managing STELAR Data Analysis Tools container images.

  • MinIO serves as a storage layer for the data assets tracked by the Data Catalog as well as for tool images.

  • Redis is used as an in-memory data structure store for caching.

  • LLM-powered Semantic Dataset Search Facility is a tool for enhancing dataset search capabilities using large language models. It is integrated into the KLMS Console and implemented as a FastAPI service under the hood.

  • STELAR Resource Previewer is a streamlit-based tool for visualizing and exploring the resources of the data catalog artifacts. It is exposed via the central ingress controller of the KLMS deployment and embedded in the KLMS Console.

  • STELAR Profile Visualizer is a tool for visualizing and exploring the profiles of the data catalog artifacts. It is also exposed via the central ingress controller of the KLMS deployment and embedded in the KLMS Console.

The STELAR KLMS supports two alternative workflow engines:

  • In its Community Edition, it supports Apache Airflow, which is a very popular open-source platform for this purpose.

  • In its Professional and Enterprise editions, it supports the RapidMiner Studio & AI Hub, which is a widely used commercial platform for machine learning and data science workflows.

While both options have been well-tested in regards with their compatibility, the range of open-source tools and systems STELAR can integrate with is limitless. Integration can be achieved by the STELAR API directly or indirectly through the STELAR Python SDK.

Access to the STELAR API is provided either directly via its RESTful endpoints or via the STELAR Python SDK, a client library for interacting with the STELAR API. The SDK is available via PyPI and can be installed via pip:

pip install stelar_client

The source code of the SDK is available at its GitHub repository.

STELAR Toolkit — Tools Index

Discovery

  • Synopsis Data Engine (SDE) — data stream summarization with persistent synopses.
    Lang: Java · Integration: In-cluster via client · Partners: ARC, TUE
    GitHub: stelar-eu/Synopses-Data-Engine

  • Correlation Detective — scalable multivariate correlation mining for vector datasets.
    Lang: Java · Integration: In-cluster · Partner: TUE
    GitHub: stelar-eu/correlation-detective · Docker: stelareu/correlation-detective

  • Forecasting Model Orchestrator (FOMO) — orchestrates/optimizes time-series forecasting models under a compute budget.
    Lang: Python 3.10 · Integration: In-cluster · Partner: TUE
    GitHub: stelar-eu/fomo · Docker: stelareu/fomo

  • TableSage — LLM-powered tabular profiling, summarization, and metadata enrichment.
    Lang: Python 3.10 · Integration: In-cluster · Partner: ARC
    GitHub: stelar-eu/TableSage-Docker · Docker: stelareu/tablesage

  • Data Profiler — automatic profiling for tabular, time-series, raster, text, hierarchical, and RDF data.
    Lang: Python 3.8 · Integration: In-cluster · Partner: ARC
    GitHub: stelar-eu/stelardataprofiler-docker · Docker: stelareu/data-profiler

Interlinking

  • pyJedAI Entity Matching (pyJedAI EM) — duplicate detection across datasets via multi-stage pipelines.
    Lang: Python 3.9 · Integration: In-cluster · Partner: UoA
    GitHub: stelar-eu/pyjedai-em · Docker: stelareu/pyjedai-em

  • pyJedAI Schema Matching (pyJedAI SM) — schema alignment for highly heterogeneous datasets.
    Lang: Python 3.9 · Integration: In-cluster · Partner: UoA
    GitHub: stelar-eu/pyjedai-sm · Docker: stelareu/pyjedai-sm

  • JedAI-spatial — interlinking for geospatial RDF; computes DE9IM topological relations.
    Lang: Java · Integration: In-cluster · Partner: UoA
    GitHub: stelar-eu/jedai-spatial · Docker: stelareu/jedai-spatial

  • Spatio-Temporal Time Series Extraction (TS-Extraction) — extracts per-pixel/field LAI statistics over time from satellite imagery.
    Lang: Python 3.10 · Integration: In-cluster · Partner: TUE
    GitHub: stelar-eu/spatiotemporal_timeseries_extraction · Docker: stelareu/ts-extract

  • Time Series Imputation (TS-Imputation) — SOTA imputation for time series (DL, statistical, and LLM-based methods).
    Lang: C#, Python 3.12 · Integration: In-cluster / Remote · Partner: ARC
    GitHub: stelar-eu/TS-Impute · Docker: stelareu/ts-impute

  • Missing Data Interpolation — fills gaps in daily weather data via inverse-distance weighted interpolation.
    Lang: Python 3.8 · Integration: In-cluster · Partner: ABACO
    GitHub: stelar-eu/missing-data-interpolation · Docker: stelareu/missing-data-interpolation

Annotation

  • Field Segmentation — automatic agricultural field boundary extraction from satellite imagery (RGB/NIR).
    Lang: Python 3.9 · Integration: In-cluster · Partner: TUE
    GitHub: stelar-eu/field_segmentation · Docker: stelareu/field-segmentation

  • AvengER — LLM ensembling/fine-tuning for entity resolution with configurable workflows and evaluation.
    Lang: Python 3.8 · Integration: In-cluster · Partner: ARC
    GitHub: stelar-eu/AvengER-Docker · Docker: stelareu/avenger

  • Generic NER — translation, summarization, NER, main-entity selection, and entity linking pipeline.
    Lang: Python 3.12 · Integration: In-cluster · Partner: ARC
    GitHub: stelar-eu/GenericNER · Docker: stelareu/generic-ner

  • Crop Classification — DL pipeline on LAI-derived time series for crop type & growth prediction.
    Lang: Python 3.11 · Integration: In-cluster / Remote · Partner: UniBwM
    GitHub: stelar-eu/crop_prediction_tool · Docker: stelareu/crop-prediction

  • Vocational Score Raster — generates rasterized vocational-skill score maps across regions.
    Lang: Python 3.8 · Integration: In-cluster · Partner: ABACO
    GitHub: stelar-eu/vocational-score-raster · Docker: stelareu/vocational-score-raster

  • Agri Products Match — matches fertilizers/pesticides to reference products via NPK/active substances; multilingual.
    Lang: Python 3.8 · Integration: In-cluster · Partner: ABACO
    GitHub: stelar-eu/agri-products-match · Docker: stelareu/agri-products-match

  • Hazard Classification — incident reporting for the agri-food domain.
    Lang: Python · Integration: None · Partner: UniBwM
    GitHub: stelar-eu/Hazard-classification · Docker: stelareu/hazard-classification

License

The contents of this project are licensed under the GPL-2.0 license.

About

Deployment related code and configuration artifacts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •