Skip to content

[FEATURE] Incremental / Delta Processing #323

@KaifAhmad1

Description

@KaifAhmad1

Process only what changed instead of reprocessing everything. Use versioned graphs and deltas to drive fast, cost‑efficient pipelines.


Relevant Modules and Test Folders

  • Implementation is split across:
    • semantica/change_management/ (snapshot/version handling).
    • semantica/triplet_store/ (delta computation between graphs).
    • semantica/pipeline/ (delta‑aware steps).
  • Tests belong under existing folders:
    • tests/change_management/
    • tests/triplet_store/
    • tests/pipeline/
  • No new packages; everything reuses the current semantica/ and tests/ layout.

High‑Level Behaviour

  • Captures snapshots of graphs over time (e.g. nightly loads, ingestion runs).
  • Computes deltas between snapshots (added/removed triples).
  • Allows validation and transformation jobs to operate on deltas instead of full datasets.
  • Integrates with the pipeline engine to support delta‑aware workflows.

Operational Benefits

  • Large datasets become practical to manage and update frequently.
  • Dramatically reduces compute cost and processing time.
  • Enables near real‑time pipelines and more frequent deployments.
  • Essential for production‑grade, large‑scale semantic infrastructure.

Current Implementation

  • Triplet store and SPARQL
    • semantica/triplet_store/triplet_store.py
    • semantica/triplet_store/query_engine.py
  • Change management and versioning
    • semantica/change_management/version_storage.py
    • semantica/change_management/managers.py
  • Pipeline
    • semantica/pipeline/pipeline_builder.py
    • semantica/pipeline/execution_engine.py
  • Docs and tests
    • docs/reference/change_management.md
    • docs/reference/triplet_store.md
    • docs/reference/pipeline.md
    • tests/change_management/test_temporal_versioning.py
    • tests/change_management/test_integration_realworld.py
    • tests/triplet_store/test_triplet_store.py
    • tests/pipeline/test_pipeline_comprehensive.py

Required Additions and Modifications

  • Snapshot APIs
    • In semantica/change_management/version_storage.py:
      • Ensure explicit functions to create and reference graph snapshots by version ID.
  • Delta computation
    • In semantica/triplet_store/triplet_store.py or query_engine.py:
      • Implement compute_delta(old_graph_uri, new_graph_uri) returning added/removed triples or delta graphs.
  • Delta‑aware pipeline configuration
    • In semantica/pipeline/pipeline_builder.py:
      • Add delta_mode and version references to step configuration.
    • In semantica/pipeline/execution_engine.py:
      • Use version_storage and compute_delta to run steps on deltas when enabled.
  • Retention and cleanup
    • In semantica/change_management/managers.py:
      • Provide policies and utilities for pruning old snapshots and deltas.
  • Documentation and tests
    • Extend docs and tests listed above with:
      • Examples and assertions for full‑graph vs delta‑mode processing.

Step‑by‑Step Implementation Guide

  1. Graph snapshot capture

    • In change_management/version_storage.py:
      • Ensure there is an API to:
        • Persist a snapshot of a given graph (by URI or logical name).
        • Tag snapshots with version IDs, timestamps, and metadata.
      • Backing storage can reuse existing SQLite or in‑memory backends.
  2. Delta computation helpers

    • In triplet_store/triplet_store.py or query_engine.py:
      • Implement a function such as:
        • compute_delta(old_graph_uri: str, new_graph_uri: str) -> Dict[str, Any]
      • Use SPARQL or native backend operations to compute:
        • added_triples
        • removed_triples
      • Represent deltas as:
        • lists of triples OR
        • named graphs like old_graph_uri + "-delta-added".
  3. Delta‑aware jobs in pipeline

    • In pipeline_builder.py:
      • Extend pipeline step configuration with:
        • delta_mode: bool
        • base_version_id, target_version_id (or equivalent).
    • In execution_engine.py:
      • When delta_mode is true:
        • Resolve the relevant snapshot URIs via version_storage.
        • Compute delta using the helper.
        • Pass only the delta graph or triple set to downstream steps (validation, enrichment, export).
  4. Configuration and cleanup

    • In change_management/managers.py:
      • Provide options for:
        • snapshot retention policies (e.g. keep last N versions).
        • automatic cleanup of obsolete snapshot graphs and delta graphs.
  5. Documentation

    • In docs/reference/change_management.md:
      • Add “Incremental / Delta Processing” section with:
        • A simple example: two snapshots, computed delta, and a delta‑aware validation job.
    • Update pipeline documentation to show how to enable delta_mode on steps.

Done When

  • Versioned snapshots of graphs can be created and referenced through change_management.
  • A public helper computes deltas between two snapshots using the existing triplet store.
  • At least one pipeline test demonstrates:
    • Full‑graph processing vs delta‑based processing, with equivalent logical results.
  • Documentation clearly describes how to configure and use incremental / delta processing in Semantica.

Metadata

Metadata

Labels

coreCore Semantica logic and abstractionsenhancementNew feature or requesthelp wantedExtra attention is needed

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions