A modular, efficient and config-driven Change Data Capture (CDC) micro-framework.
β οΈ Status: Active development. APIs, configuration, and semantics may change.
DeltaForge is a lightweight framework for building CDC pipelines that stream database changes into downstream systems such as Kafka, Redis, and NATS. It focuses on:
- User Control : Using an embedded JS engine, users can fully control what happens to each event.
- Config-driven pipelines : YAML-defined pipelines instead of bespoke code per use-case.
- Cloud-Native : CN first design and operation.
- Extensibility : add your own sources, processors, and sinks.
However, DeltaForge is NOT a DAG based stream processor. DeltaForge is meant to replace tools like Debezium and similar.
Get DeltaForge running in under 5 minutes:
# pipeline.yaml
apiVersion: deltaforge/v1
kind: Pipeline
metadata:
name: my-first-pipeline
tenant: demo
spec:
source:
type: mysql
config:
id: mysql-src
dsn: ${MYSQL_DSN}
tables: [mydb.users]
processors: []
sinks:
- type: kafka
config:
id: kafka-sink
brokers: ${KAFKA_BROKERS}
topic: users.cdc |
docker run --rm \
-e MYSQL_DSN="mysql://user:pass@host:3306/mydb" \
-e KAFKA_BROKERS="kafka:9092" \
-v $(pwd)/pipeline.yaml:/etc/deltaforge/pipeline.yaml:ro \
ghcr.io/vnvo/deltaforge:latest \
--config /etc/deltaforge/pipeline.yamlThat's it! DeltaForge streams changes from Want Debezium-compatible output? sinks:
- type: kafka
config:
id: kafka-sink
brokers: ${KAFKA_BROKERS}
topic: users.cdc
envelope:
type: debeziumOutput: π Full docs Β· Configuration reference |
| Built with | Sources | Processors | Sinks | Output Formats |
|
Rust |
MySQL Β· PostgreSQL |
JavaScript |
Kafka Β· Redis Β· NATS |
|
-
Sources
- MySQL binlog CDC with GTID support
- PostgreSQL logical replication via pgoutput
- Turso/libSQL CDC (experimental, behind
tursofeature flag)
-
Schema Registry
- Source-owned schema types (source native semantics)
- Schema change detection and versioning
- SHA-256 fingerprinting for stable change detection
-
Schema Sensing
- Automatic schema inference from JSON event payloads
- Deep inspection for nested JSON structures
- Configurable sampling with warmup and cache optimization
- Drift detection comparing DB schema vs observed data
- JSON Schema export for downstream consumers
-
Checkpoints
- Pluggable backends (File, SQLite with versioning, in-memory)
- Configurable commit policies (all, required, quorum)
- Transaction boundary preservation (best effort)
-
Processors
- JavaScript processors using
deno_core:- Run user defined functions (UDFs) in JS to transform batches of events
- JavaScript processors using
-
Sinks
- Kafka producer sink (via
rdkafka) - Redis stream sink
- NATS JetStream sink (via
async_nats) - Configurable envelope formats: Native, Debezium, CloudEvents
- JSON wire encoding (Avro planned and more to come)
- Kafka producer sink (via
DeltaForge supports multiple envelope formats for ecosystem compatibility:
| Format | Output | Use Case |
|---|---|---|
native |
{"op":"c","after":{...},"source":{...}} |
Lowest overhead, DeltaForge consumers |
debezium |
{"schema":null,"payload":{...}} |
Drop-in Debezium replacement |
cloudevents |
{"specversion":"1.0","type":"...","data":{...}} |
CNCF-standard, event-driven systems |
π Debezium Compatibility: DeltaForge uses Debezium's schemaless mode (schema: null), which matches Debezium's JsonConverter with schemas.enable=false - the recommended configuration for most Kafka deployments. This provides wire compatibility with existing Debezium consumers without the overhead of inline schemas (~500+ bytes per message).
π‘ Migrating from Debezium? If your consumers already use
schemas.enable=false, configureenvelope: { type: debezium }on your sinks for drop-in compatibility. For consumers expecting inline schemas, you'll need Schema Registry integration (Avro encoding - planned).
See Envelope Formats for detailed examples and wire format specifications.
- π Online docs: https://vnvo.github.io/deltaforge
- π Local:
mdbook serve docs(browse at http://localhost:3000)
Use the bundled dev.sh CLI to spin up the dependency stack and run common workflows consistently:
./dev.sh up # start Postgres, MySQL, Kafka, Redis, NATS from docker-compose.dev.yml
./dev.sh ps # view container status
./dev.sh check # fmt --check + clippy + tests (matches CI)See the Development guide for the full layout and additional info.
Pre-built multi-arch images (amd64/arm64) are available:
# From GitHub Container Registry
docker pull ghcr.io/vnvo/deltaforge:latest
# From Docker Hub
docker pull vnvohub/deltaforge:latest
# Debug variant (includes shell)
docker pull ghcr.io/vnvo/deltaforge:latest-debugOr build locally:
docker build -t deltaforge:local .Run it by mounting your pipeline specs (environment variables are expanded inside the YAML) and exposing the API and metrics ports:
docker run --rm \
-p 8080:8080 -p 9000:9000 \
-v $(pwd)/examples/dev.yaml:/etc/deltaforge/pipelines.yaml:ro \
-v deltaforge-checkpoints:/app/data \
deltaforge:local \
--config /etc/deltaforge/pipelines.yamlor with env variables to be expanded inside the provided config:
# pull the container
docker pull ghcr.io/vnvo/deltaforge:latest
# run it
docker run --rm \
-p 8080:8080 -p 9000:9000 \
-e MYSQL_DSN="mysql://user:pass@host:3306/db" \
-e KAFKA_BROKERS="kafka:9092" \
-v $(pwd)/pipeline.yaml:/etc/deltaforge/pipeline.yaml:ro \
-v deltaforge-checkpoints:/app/data \
ghcr.io/vnvo/deltaforge:latest \
--config /etc/deltaforge/pipeline.yamlThe container runs as a non-root user, writes checkpoints to /app/data/df_checkpoints.json, and listens on 0.0.0.0:8080 for the control plane API with metrics served on :9000.
DeltaForge guarantees at-least-once delivery through careful checkpoint ordering:
Source β Processor β Sink (deliver) β Checkpoint (save)
β
Sink acknowledges
successful delivery
β
THEN checkpoint saved
Checkpoints are never saved before events are delivered. A crash between delivery and checkpoint causes replay (duplicates possible), but never loss.
The schema registry tracks schema versions with sequence numbers and optional checkpoint correlation. During replay, events are interpreted with the schema that was active when they were produced - even if the table structure has since changed.
Unlike tools that normalize all databases to a universal type system, DeltaForge lets each source define its own schema semantics. MySQL schemas capture MySQL types (bigint(20) unsigned, json), PostgreSQL schemas preserve arrays and custom types. No lossy normalization, no universal type maintenance burden.
The REST API exposes JSON endpoints for liveness, readiness, and pipeline lifecycle
management. Routes key pipelines by the metadata.name field from their specs and
return PipeInfo payloads that include the pipeline name, status, and full
configuration.
GET /healthz- lightweight liveness probe returningok.GET /readyz- readiness view returning{"status":"ready","pipelines":[...]}with the current pipeline states.
GET /pipelines- list all pipelines with their current status and config.POST /pipelines- create a new pipeline from a fullPipelineSpecdocument.GET /pipelines/{name}- get a single pipeline by name.PATCH /pipelines/{name}- apply a partial JSON patch to an existing pipeline (e.g., adjust batch or connection settings) and restart it with the merged spec.DELETE /pipelines/{name}- permanently delete a pipeline.POST /pipelines/{name}/pause- pause ingestion and processing for the pipeline.POST /pipelines/{name}/resume- resume a paused pipeline.POST /pipelines/{name}/stop- stop a running pipeline.
GET /pipelines/{name}/schemas- list DB schemas for the pipeline.GET /pipelines/{name}/sensing/schemas- list inferred schemas (from sensing).GET /pipelines/{name}/sensing/schemas/{table}- get inferred schema details.GET /pipelines/{name}/sensing/schemas/{table}/json-schema- export as JSON Schema.GET /pipelines/{name}/drift- get drift detection results.GET /pipelines/{name}/sensing/stats- get schema sensing cache statistics.
Pipelines are defined as YAML documents that map directly to the internal PipelineSpec type.
Environment variables are expanded before parsing, so secrets and URLs can be injected at runtime.
metadata:
name: orders-mysql-to-kafka
tenant: acme
spec:
sharding:
mode: hash
count: 4
key: customer_id
source:
type: mysql
config:
id: orders-mysql
dsn: ${MYSQL_DSN}
tables:
- shop.orders
processors:
- type: javascript
id: my-custom-transform
inline: |
function processBatch(events) {
return events;
}
limits:
cpu_ms: 50
mem_mb: 128
timeout_ms: 500
sinks:
- type: kafka
config:
id: orders-kafka
brokers: ${KAFKA_BROKERS}
topic: orders
envelope:
type: debezium
encoding: json
required: true
exactly_once: false
- type: redis
config:
id: orders-redis
uri: ${REDIS_URI}
stream: orders
envelope:
type: native
encoding: json
batch:
max_events: 500
max_bytes: 1048576
max_ms: 1000
respect_source_tx: true
commit_policy:
mode: quorum
quorum: 2
schema_sensing:
enabled: true
deep_inspect:
enabled: true
max_depth: 3
sampling:
warmup_events: 50
sample_rate: 5 |
π Full reference: Configuration docs View actual examples: Example Configurations |
- Outbox pattern support
- Persistent schema registry (SQLite, then PostgreSQL)
- Protobuf encoding
- PostgreSQL/S3 checkpoint backends for HA
- MongoDB source
- ClickHouse sink
- Event store for time-based replay
- Distributed coordination for HA
Licensed under either of
- MIT License (see
LICENSE-MIT) - Apache License, Version 2.0 (see
LICENSE-APACHE)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project by you shall be dual licensed as above, without additional terms or conditions.
