-
-
Notifications
You must be signed in to change notification settings - Fork 111
Open
Labels
coreCore Semantica logic and abstractionsCore Semantica logic and abstractionsenhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed
Description
Process only what changed instead of reprocessing everything. Use versioned graphs and deltas to drive fast, cost‑efficient pipelines.
Relevant Modules and Test Folders
- Implementation is split across:
semantica/change_management/(snapshot/version handling).semantica/triplet_store/(delta computation between graphs).semantica/pipeline/(delta‑aware steps).
- Tests belong under existing folders:
tests/change_management/tests/triplet_store/tests/pipeline/
- No new packages; everything reuses the current
semantica/andtests/layout.
High‑Level Behaviour
- Captures snapshots of graphs over time (e.g. nightly loads, ingestion runs).
- Computes deltas between snapshots (added/removed triples).
- Allows validation and transformation jobs to operate on deltas instead of full datasets.
- Integrates with the pipeline engine to support delta‑aware workflows.
Operational Benefits
- Large datasets become practical to manage and update frequently.
- Dramatically reduces compute cost and processing time.
- Enables near real‑time pipelines and more frequent deployments.
- Essential for production‑grade, large‑scale semantic infrastructure.
Current Implementation
- Triplet store and SPARQL
- semantica/triplet_store/triplet_store.py
- semantica/triplet_store/query_engine.py
- Change management and versioning
- semantica/change_management/version_storage.py
- semantica/change_management/managers.py
- Pipeline
- semantica/pipeline/pipeline_builder.py
- semantica/pipeline/execution_engine.py
- Docs and tests
- docs/reference/change_management.md
- docs/reference/triplet_store.md
- docs/reference/pipeline.md
- tests/change_management/test_temporal_versioning.py
- tests/change_management/test_integration_realworld.py
- tests/triplet_store/test_triplet_store.py
- tests/pipeline/test_pipeline_comprehensive.py
Required Additions and Modifications
- Snapshot APIs
- In semantica/change_management/version_storage.py:
- Ensure explicit functions to create and reference graph snapshots by version ID.
- In semantica/change_management/version_storage.py:
- Delta computation
- In semantica/triplet_store/triplet_store.py or query_engine.py:
- Implement compute_delta(old_graph_uri, new_graph_uri) returning added/removed triples or delta graphs.
- In semantica/triplet_store/triplet_store.py or query_engine.py:
- Delta‑aware pipeline configuration
- In semantica/pipeline/pipeline_builder.py:
- Add delta_mode and version references to step configuration.
- In semantica/pipeline/execution_engine.py:
- Use version_storage and compute_delta to run steps on deltas when enabled.
- In semantica/pipeline/pipeline_builder.py:
- Retention and cleanup
- In semantica/change_management/managers.py:
- Provide policies and utilities for pruning old snapshots and deltas.
- In semantica/change_management/managers.py:
- Documentation and tests
- Extend docs and tests listed above with:
- Examples and assertions for full‑graph vs delta‑mode processing.
- Extend docs and tests listed above with:
Step‑by‑Step Implementation Guide
-
Graph snapshot capture
- In change_management/version_storage.py:
- Ensure there is an API to:
- Persist a snapshot of a given graph (by URI or logical name).
- Tag snapshots with version IDs, timestamps, and metadata.
- Backing storage can reuse existing SQLite or in‑memory backends.
- Ensure there is an API to:
- In change_management/version_storage.py:
-
Delta computation helpers
- In triplet_store/triplet_store.py or query_engine.py:
- Implement a function such as:
- compute_delta(old_graph_uri: str, new_graph_uri: str) -> Dict[str, Any]
- Use SPARQL or native backend operations to compute:
- added_triples
- removed_triples
- Represent deltas as:
- lists of triples OR
- named graphs like old_graph_uri + "-delta-added".
- Implement a function such as:
- In triplet_store/triplet_store.py or query_engine.py:
-
Delta‑aware jobs in pipeline
- In pipeline_builder.py:
- Extend pipeline step configuration with:
- delta_mode: bool
- base_version_id, target_version_id (or equivalent).
- Extend pipeline step configuration with:
- In execution_engine.py:
- When delta_mode is true:
- Resolve the relevant snapshot URIs via version_storage.
- Compute delta using the helper.
- Pass only the delta graph or triple set to downstream steps (validation, enrichment, export).
- When delta_mode is true:
- In pipeline_builder.py:
-
Configuration and cleanup
- In change_management/managers.py:
- Provide options for:
- snapshot retention policies (e.g. keep last N versions).
- automatic cleanup of obsolete snapshot graphs and delta graphs.
- Provide options for:
- In change_management/managers.py:
-
Documentation
- In docs/reference/change_management.md:
- Add “Incremental / Delta Processing” section with:
- A simple example: two snapshots, computed delta, and a delta‑aware validation job.
- Add “Incremental / Delta Processing” section with:
- Update pipeline documentation to show how to enable delta_mode on steps.
- In docs/reference/change_management.md:
Done When
- Versioned snapshots of graphs can be created and referenced through change_management.
- A public helper computes deltas between two snapshots using the existing triplet store.
- At least one pipeline test demonstrates:
- Full‑graph processing vs delta‑based processing, with equivalent logical results.
- Documentation clearly describes how to configure and use incremental / delta processing in Semantica.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
coreCore Semantica logic and abstractionsCore Semantica logic and abstractionsenhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed
Type
Projects
Status
In progress